Initial Data Loading and Inspection

Import the dataset using libraries like pandas.
Perform initial inspection (head, info, describe) to understand the structure, size, and types of dat
a.
Exploratory Data Analysis (EDA)

Data Cleaning: Handling missing values, removing duplicates.
Text Cleaning: Lowercasing, removing punctuation, removing stop words, stemming/lemmatization.
Visualization: Word clouds, frequency distributions of words or classes, etc.
Analyzing the distribution of target classes (sentiment labels) to check for imbalance.
Basic statistics: Average length of text, most common wo
rds, etc.
Feature Engineering

Text Vectorization: Converting text to numerical format using techniques like Bag-of-Words, TF-IDF.
Feature Selection: Deciding which features (words or n-grams) to include 
in the model.
Data Preprocessing for Modeling

Splitting the data into training and testing sets.
Further feature scaling or normalizati
on, if necessary.
Model Building

Choosing appropriate models for sentiment analysis (e.g., logistic regression, Naive Bayes, SVM, deep learning models like LSTM).
Training the models on the training set.
Hyperparameter tuning u
sing cross-validation.
Model Evaluation

Evaluating the models using appropriate metrics (accuracy, precision, recall, F1-score, ROC-AUC).
Confusion matrix to understand true positives, false positives, true negatives, and false negatives.
Model Interpretation

Analyzing the results to understand which features are most important for sentiment prediction.
Error analysis to understand where and why the model is making errors.
Model Deployment (Optional)

Deploying the model to a server or integrating it into an existing application.
Creating an API for the model to make predictions on new data.

# 

* Understanding the Data Structure* 

Data Overview: Use methods like df.head(), df.info(), and df.describe() in pandas to get an overview of the dataset, understand the data types, and identify missing values.
Identifying Features: Determine which columns are features (e.g., text data) and which is the target variable (e.g., sentiment labels
* ).
Data Clni* ng

Handling Missing Values: Identify and handle missing values in your dataset. You might fill them with a placeholder value or remove the rows/columns with missing data.
Removing Duplicates: Check for and remove duplicate entries to prevent bias in the 
* model.
Text Datareproc* essing

Normalization: Convert text to a consistent format, like lowercasing, to avoid duplication based on case dif* ferences.
Noise Removal: Remove irrelevant characters, such as punctuation, special characters, and numerical values, which might not be significant in sentiment*  analysis.
Tokenization: Break down the text into individual words*  or tokens.
Stop Words Removal: Eliminate common words that may not contribute much to the sentiment (e.g., "the", * "is", "at").
Stemming/Lemmatization: Reduce words to their base or root form. For example, “running” b

** ecomes “run”.
Text Data Exploration

Word Frequency Analysis: Analyze the most common words or terms in your data. Tools like word clouds can be visually informative.
Sentiment Distribution: Check the distribution of different sentiment classes (e.g., positive, negative, neutral) to identify any class imbalances that might require addressing.
Text Length Analysis: Explore the length of the text entries and its distribution. It can be insightful to understand the dataset's verbosity and see if it correla
t* es with sentiment.
Advanced Analysis (Optional)

N-grams Analysis: Explore combinations of N adjacent words (bigrams, trigrams, etc.) to capture more context than single words.
Correlation Analysis: Examine relationships between text features and sentiment, if applicable.
Sentiment Over Time: If your dataset is time-based (like tweets), analyzing sentiment trends over 
* time can be insightful.
Visualizing the Data

Use graphs like bar charts, histograms, and scatter plots to visualize distributions of text length, word frequencies, and sentiment classes.
Word clouds for the most frequent words in each sentiment class can prov
* ide a quick visual insight.
Initial Insights and Hypotheses

Based on the EDA, you can formulate hypotheses about your data. For instance, certain words might strongly indicate a particular sentiment.
Identify potential challenges like class imbalance or a limited vocabulary range that might affect model training.

2. Sentiment Distribution
Understanding the distribution of different sentiment classes in your dataset is crucial for several reasons:

Class Balance: In sentiment analysis, your dataset might have categories like positive, negative, and neutral. It's important to know if these categories are evenly represented. An imbalanced dataset can lead to biased models; for example, if most of your data is positive, the model might be less accurate in identifying negative sentiments.

Visualization: You can use bar charts or pie charts to visually represent the proportion of each sentiment class. For example, a bar chart showing the number of instances for each sentiment category (positive, negative, neutral) can quickly reveal imbalances.

Strategies for Imbalance: If you find imbalances, you might consider techniques like resampling (either oversampling the minority class or undersampling the majority class), generating synthetic samples (e.g., using SMOTE), or adjusting the class weights in model training.

Insights into Data Quality: Sometimes, the distribution of sentiments can reflect on the data collection process. For instance, a dataset scraped from a particular site might be skewed towards positive reviews due to the nature of the site.

3. Text Length Analysis
Analyzing the length of text entries provides insights into the verbosity and the level of detail in your text data:

Length Metrics: Calculate metrics like average length, median length, shortest and longest texts. This helps in understanding the general verbosity in your dataset.

Distribution Visualization: Plotting a histogram of text lengths can be very insightful. It shows how text length varies across your dataset and might reveal patterns or outliers. For instance, you might find that most negative sentiments are expressed briefly, while positive feedback tends to be more verbose.

Correlation with Sentiment: Sometimes, the length of the text can correlate with the sentiment. Longer texts might be more likely to be detailed reviews (positive or negative), while shorter texts might be simple expressions of satisfaction or dissatisfaction.

Impact on Modeling: Knowing about text length can influence preprocessing steps (like deciding on a word limit for each input in neural networks) and can also hint at the need for different feature extraction techniques (like using n-grams to capture more context in shorter texts).

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from warnings import filterwarnings
filterwarnings('ignore')

#preprocessing

import sklearn
import nltk
import spacy
import wordcloud

In [3]:
nltk.download('stopwords')
from sklearn.feature_extraction import text
from sklearn.model_selection import train_test_split


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\modza\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
!pip install plotly

Collecting plotly
  Downloading plotly-5.18.0-py3-none-any.whl.metadata (7.0 kB)
Collecting tenacity>=6.2.0 (from plotly)
  Downloading tenacity-8.2.3-py3-none-any.whl.metadata (1.0 kB)
Downloading plotly-5.18.0-py3-none-any.whl (15.6 MB)
   ---------------------------------------- 0.0/15.6 MB ? eta -:--:--
   ---------------------------------------- 0.0/15.6 MB ? eta -:--:--
   ---------------------------------------- 0.0/15.6 MB ? eta -:--:--
   ---------------------------------------- 0.1/15.6 MB 1.5 MB/s eta 0:00:11
   - -------------------------------------- 0.7/15.6 MB 5.9 MB/s eta 0:00:03
   -- ------------------------------------- 1.2/15.6 MB 8.1 MB/s eta 0:00:02
   ---- ----------------------------------- 1.9/15.6 MB 9.4 MB/s eta 0:00:02
   ----- ---------------------------------- 2.0/15.6 MB 7.9 MB/s eta 0:00:02
   ----- ---------------------------------- 2.2/15.6 MB 7.5 MB/s eta 0:00:02
   ------- -------------------------------- 2.9/15.6 MB 8.0 MB/s eta 0:00:02
   -------- 

In [8]:
!pip install tensorflow

Collecting tensorflow
  Using cached tensorflow-2.15.0-cp311-cp311-win_amd64.whl.metadata (3.6 kB)
Collecting tensorflow-intel==2.15.0 (from tensorflow)
  Using cached tensorflow_intel-2.15.0-cp311-cp311-win_amd64.whl.metadata (5.1 kB)
Collecting absl-py>=1.0.0 (from tensorflow-intel==2.15.0->tensorflow)
  Downloading absl_py-2.1.0-py3-none-any.whl.metadata (2.3 kB)
Collecting astunparse>=1.6.0 (from tensorflow-intel==2.15.0->tensorflow)
  Using cached astunparse-1.6.3-py2.py3-none-any.whl (12 kB)
Collecting flatbuffers>=23.5.26 (from tensorflow-intel==2.15.0->tensorflow)
  Using cached flatbuffers-23.5.26-py2.py3-none-any.whl.metadata (850 bytes)
Collecting gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 (from tensorflow-intel==2.15.0->tensorflow)
  Using cached gast-0.5.4-py3-none-any.whl (19 kB)
Collecting google-pasta>=0.1.1 (from tensorflow-intel==2.15.0->tensorflow)
  Using cached google_pasta-0.2.0-py3-none-any.whl (57 kB)
Collecting h5py>=2.9.0 (from tensorflow-intel==2.15.0->tensorflow)
 

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
mediapipe 0.10.9 requires protobuf<4,>=3.11, but you have protobuf 4.23.4 which is incompatible.


In [9]:
pd.options.plotting.backend ='plotly'
import tensorflow as tf




# Exploratory Data Analysis

In [7]:
df = pd.read_csv('Twitter_Data.csv')
df.head()

Unnamed: 0,clean_text,category
0,when modi promised “minimum government maximum...,-1.0
1,talk all the nonsense and continue all the dra...,0.0
2,what did just say vote for modi welcome bjp t...,1.0
3,asking his supporters prefix chowkidar their n...,1.0
4,answer who among these the most powerful world...,1.0


In [10]:
df.shape

(162980, 2)

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162980 entries, 0 to 162979
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   clean_text  162976 non-null  object 
 1   category    162973 non-null  float64
dtypes: float64(1), object(1)
memory usage: 2.5+ MB


In [14]:
df.clean_text[0]

'when modi promised “minimum government maximum governance” expected him begin the difficult job reforming the state why does take years get justice state should and not business and should exit psus and temples'

In [18]:
#from pandas.io.json import json_normalize
from pandas.io.json import _normalize

In [21]:
ad = pd.read_json('dataset_infos.json')

In [23]:
_normalize.json_normalize(pd.read_json('dataset_infos.json'))

0


In [25]:
!pip install datasets

Collecting datasets
  Using cached datasets-2.16.1-py3-none-any.whl.metadata (20 kB)
Collecting pyarrow>=8.0.0 (from datasets)
  Downloading pyarrow-15.0.0-cp311-cp311-win_amd64.whl.metadata (3.1 kB)
Collecting pyarrow-hotfix (from datasets)
  Using cached pyarrow_hotfix-0.6-py3-none-any.whl.metadata (3.6 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Using cached dill-0.3.7-py3-none-any.whl.metadata (9.9 kB)
Collecting xxhash (from datasets)
  Using cached xxhash-3.4.1-cp311-cp311-win_amd64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Using cached multiprocess-0.70.15-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2023.10.0,>=2023.1.0 (from fsspec[http]<=2023.10.0,>=2023.1.0->datasets)
  Using cached fsspec-2023.10.0-py3-none-any.whl.metadata (6.8 kB)
Collecting aiohttp (from datasets)
  Using cached aiohttp-3.9.1-cp311-cp311-win_amd64.whl.metadata (7.6 kB)
Collecting multidict<7.0,>=4.5 (from aiohttp->datasets)
  Using cached multidict-6.0.4-cp311-cp31

from datasets import load_dataset

dataset = load_dataset("carblacac/twitter-sentiment-analysis")

In [28]:
df

Unnamed: 0,clean_text,category
0,when modi promised “minimum government maximum...,-1.0
1,talk all the nonsense and continue all the dra...,0.0
2,what did just say vote for modi welcome bjp t...,1.0
3,asking his supporters prefix chowkidar their n...,1.0
4,answer who among these the most powerful world...,1.0
...,...,...
162975,why these 456 crores paid neerav modi not reco...,-1.0
162976,dear rss terrorist payal gawar what about modi...,-1.0
162977,did you cover her interaction forum where she ...,0.0
162978,there big project came into india modi dream p...,0.0


In [29]:
df.isnull().sum()

clean_text    4
category      7
dtype: int64

In [30]:
df.dropna(inplace = True)

In [32]:
X = df['clean_text']

y = df['category']


X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=7)


In [33]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [41]:
tfidf = TfidfVectorizer(stop_words='english',max_df = 0.75)
tfidf

In [36]:
from spacy.lang.en import stop_words

In [38]:
from spacy.lang.en.stop_words import STOP_WORDS      

In [40]:
stop_words.STOP_WORDS

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'fron

In [42]:
tfidf.fit(X_train)

tfidf.get_feature_names_out().tolist()

In [47]:
Xtrain_dtm = tfidf.transform(X_train)
Xtrain_dtm

<122226x89235 sparse matrix of type '<class 'numpy.float64'>'
	with 1421153 stored elements in Compressed Sparse Row format>

In [48]:
Xtest_dtm = tfidf.transform(X_test)

## Model Building

In [50]:
from sklearn import naive_bayes

In [None]:
naive_bayes.MultinomialNB

In [51]:
#from sklearn.naive_bayes.
from sklearn import metrics

In [52]:
nb = naive_bayes.MultinomialNB()
nb

In [53]:
nb.fit(Xtrain_dtm,y_train)

In [55]:
nb.predict(Xtest_dtm)

array([0., 1., 0., ..., 1., 1., 1.])

In [56]:
metrics.accuracy_score(y_test, nb.predict(Xtest_dtm))

0.5694475124561275

In [59]:
tfidf.get_feature_names_out()[0:200]

array(['000', '0000', '00000', '000000', '0000000', '00000000',
       '000000000', '0000000000', '0000000001', '0001', '0005',
       '000kwkajhatka', '000s', '000waiting', '001', '001to', '002',
       '007', '007james', '00905523223324', '00s', '00xxxxxxxxx',
       '0100am', '0100kph', '0104also', '0118', '015', '018sec',
       '01surgical', '02antisatellite', '032019i', '0327', '0339', '034',
       '0351', '03it', '03jalgaon', '03x', '0404health', '041', '0414',
       '0448pm', '046', '048', '04yrs', '0500', '05042019', '0554',
       '05852234246', '05yrs', '06042019', '0607', '0609', '0700',
       '070319', '070801z', '07102001', '074223z', '07989964299', '080',
       '0800', '080916', '081', '081116', '081127z', '089', '08th', '090',
       '0914', '092807z', '0930', '0945', '09555560725', '09999150812',
       '09th', '0pposed', '0seats', '0the', '100', '1000', '10000',
       '100000', '1000000', '10000000', '1000000000', '1000000000000000',
       '100000000000000000000

Relationship between Bag of Words and Vectorization Methods
Bag of Words (BoW) and vectorization methods in Natural Language Processing (NLP) are closely related but are not exactly the same.

BoW is a specific method of representing text data as a collection of words, disregarding grammar and word order but keeping multiplicity (the frequency of each word).

Vectorization is the general process of converting text into numerical feature vectors. This involves several steps, including tokenization, counting, and normalization.

Count Vectorization and BoW are similar in that they both create a representation of a document by counting the frequency of words. However, count vectorization typically creates vectors for the entire vocabulary of the dataset.

TF-IDF (Term Frequency-Inverse Document Frequency) Vectorization is another common form of vectorization. Unlike simple count vectorization, TF-IDF also considers the importance of each term in relation to the dataset as a whole.

Sentiment Analysis Project Steps
For a sentiment analysis project using a dataset from a CSV file, the steps include:

Initial Data Loading and Inspection
Exploratory Data Analysis (EDA)
Data Cleaning, Text Cleaning, Visualization, Basic Statistics
Feature Engineering
Text Vectorization using techniques like Bag-of-Words, TF-IDF
Data Preprocessing for Modeling
Data Splitting, Feature Scaling
Model Building
Model Selection, Training, Hyperparameter Tuning
Model Evaluation
Accuracy, Precision, Recall, F1-Score, ROC-AUC
Model Interpretation
Model Deployment (optional)
Reporting and Documentation
Improvement and Iteration
EDA in Depth for Sentiment Analysis
Sentiment Distribution: Analyze class balance (positive, negative, neutral) using visualizations like bar charts or pie charts.
Text Length Analysis: Metrics such as average length, distribution visualization, correlation with sentiment, impact on modeling.
Feature Selection in Text Data
Frequency-Based Selection: Select features based on word frequency.
Mutual Information and Chi-Squared: Measure association between each feature and the target variable.
TF-IDF: Weigh words based on their importance in the document and across the corpus.
Machine Learning for Feature Selection: Use models like L1-regularized logistic regression for feature selection.
Example of Feature Selection using TF-IDF and Chi-Squared
python
Copy code
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, chi2

texts = ["I love this movie", "I hate this movie", "This is a great movie", "This is a bad movie"]
labels = [1, 0, 1, 0]  # 1 for positive, 0 for negative

tfidf = TfidfVectorizer(ngram_range=(1, 2))
features = tfidf.fit_transform(texts)

chi2_selector = SelectKBest(chi2, k=2)
X_kbest = chi2_selector.fit_transform(features, labels)

from manim import *
from IPython.display import Video

class ForLoopAnimation(Scene):
    def construct(self):
        # Create a list to iterate over
        elements = [1, 2, 3, 4, 5]
        element_texts = [Text(str(element)) for element in elements]

        # Position the elements horizontally
        self.play(*[Write(element_text) for element_text in element_texts])
        self.wait(1)
        for element_text in element_texts:
            # Highlight the current element
            self.play(element_text.animate.set_color(YELLOW))
            self.wait(1)
            # Return to the original color
            self.play(element_text.animate.set_color(WHITE))
            self.wait(0.5)

        self.wait(2)

# Render the animation
# Note: We use lower quality (-ql) for faster rendering in Colab
!manim -pql -o for_loop_animation.mp4 for_loop_animation.py ForLoopAnimation

# Display the rendered video
Video("for_loop_animation.mp4")
