<a href="https://colab.research.google.com/github/xiao-yucheng0625/ML-100-DAY/blob/main/%E3%80%8CComplete_Sentiment_Analysis_and_Prediction%E3%80%8D%E7%9A%84%E5%89%AF%E6%9C%AC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Our journey from analysis to prediction includes:

* **Understanding the data**
* **Data Preprocessing and Cleaning**
* **Exploratory data analysis (EDA)**
* **Data Modeling**
* **Model Evaluation**

> ##   Importing Libraries
> * pandas: A library for data manipulation and analysis, offering powerful data structures like DataFrames for handling structured data efficiently.
> * numpy: A library for numerical computing, providing support for arrays, matrices, and high-level mathematical operations.
> * matplotlib.pyplot: A plotting library used for creating static, interactive, and animated visualizations.
> * seaborn: A data visualization library built on top of Matplotlib, offering a high-level interface for creating attractive and informative statistical graphics.
> * scipy.stats: A module within SciPy for performing statistical tests, probability distributions, and other statistical functions.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from IPython.display import display

> ## Reading the Dataset with Pandas

> **"Let's start by reading the data using pandas. However, there are some important points to consider to avoid errors during data reading.**
>
> 1. First, it's crucial to select the encoding as **'latin1'** because if we don't, it will result in a character encoding error. So, **use: encoding='latin1'**.
>
> 1. Secondly, since our data doesn't have column names, we need to specify the header as None. Otherwise, pandas will treat the first row as the header and assign it as column names by default. So, **use: header=None."**

In [4]:
data = pd.read_csv(r"training.1600000.processed.noemoticon.csv", encoding="latin1", header=None)

ParserError: Error tokenizing data. C error: EOF inside string starting at row 28219

# 1. Understanding the Data
**Let’s take a closer look at our dataset to better understand its structure and content.**

> **Let’s use the shape attribute from pandas. It provides the number of rows and columns in a DataFrame, helping us understand the dataset’s structure.
We found that the dataset consists of 1,600,000 rows and 6 columns.**

In [4]:
data.shape

(1600000, 6)

> **Let’s also use the head() function from pandas. It gives us a quick look at the first few rows of the dataset. By default, it shows the first 5 rows, but you can specify a different number by passing an argument, such as head(10) to display the first 10 rows.**

In [5]:
data.head()

Unnamed: 0,0,1,2,3,4,5
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


> **"After displaying the data, you'll notice that the columns have been assigned names from 0 to 5. However, now we need to rename the columns with meaningful names that reflect their content."**

In [6]:
columns = ["sentiment", "ids", "date", "flag", "user", "tweet"]
data.columns= columns
data.head()

Unnamed: 0,sentiment,ids,date,flag,user,tweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


> **Let’s display information about the data using info(). This function performs the tasks of both the shape and dtypes attributes, but in addition, it shows the number of rows and columns, the data type of each column, and whether any column contains null values.**

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   sentiment  1600000 non-null  int64 
 1   ids        1600000 non-null  int64 
 2   date       1600000 non-null  object
 3   flag       1600000 non-null  object
 4   user       1600000 non-null  object
 5   tweet      1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


# 2. Data Preprocessing and Cleaning
**Now, let's move on to data preprocessing and cleaning to ensure our dataset is well-structured and ready for analysis and modeling.**

> "First, we need to focus on the most relevant features in our dataset: **'tweet'** and **'sentiment'**. The rest of the columns are not important.
>
> * **'tweet'** is the independent feature **(input)** that we'll use to predict sentiment.
> * **'sentiment'** is the target variable **(output)** that represents the sentiment classification."

In [9]:
df = data[["tweet", "sentiment"]].copy()
df.head()

Unnamed: 0,tweet,sentiment
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",0
1,is upset that he can't update his Facebook by ...,0
2,@Kenichan I dived many times for the ball. Man...,0
3,my whole body feels itchy and like its on fire,0
4,"@nationwideclass no, it's not behaving at all....",0


## Cleaning Tweets:
**To improve the quality of our dataset, we need to clean the tweets by removing unnecessary elements like links, special characters, and stopwords. Let's go through the function step by step:**

> **1. Importing Required LibrariesWe first import the necessary libraries for text processing:**
>
> * **re:** For regular expressions, which help in text cleaning.
> * **nltk:** A powerful library for natural language processing (NLP).
> * **stopwords from nltk.corpus:** A list of common words (e.g., "the", "is", "and") that don't add much meaning.
> * **PorterStemmer:** A stemming algorithm that reduces words to their root form (e.g., "running" → "run").
> * **STOPWORDS from wordcloud**: Another set of common words to remove.
>
> **2. Downloading and Defining Stopwords:**
> * We download the stopwords list using nltk.download('stopwords').
> * We create a set of stopwords for faster lookups.
> * We add extra stopwords: "amp", "rt", "lt", and "gt" (often seen in tweets but aren't meaningful).
>
> **3. The clean_tweet Function:**
> * **tweet = tweet.lower()** >>> Converting to Lowercase
> * **tweet = re.sub(r"https?://\S+", "", tweet)** >>> Removing URLs
> * **tweet = re.sub(r"@\w+|#", "", tweet)** >>> Removing Mentions(@username) and Hashtags(#happy → happy)
> * **tweet = re.sub(r"[^\w\s]|[\d]", "", tweet)** >>> Removing Special Characters and Numbers
> * **tweet = " ".join([stemmer.stem(word) for word in tweet.split() if word not in stop_words])** >>> Tokenization, Stopword Removal, and Stemming
>
> **Final Cleaned Tweet Output:**
> * After processing, a tweet like:
📌 **"RT @user: Check this out! https://www.kaggle.com/mohamedhelmyali #amazing"**
> * Will be cleaned to:
✅ **"check amazing"**

In [10]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from wordcloud import STOPWORDS

nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
stop_words.update(["amp", "rt", "lt", "gt"])

stemmer = PorterStemmer()

def clean_tweet(tweet):
    tweet = tweet.lower()
    tweet = re.sub(r"https?://\S+", "", tweet)
    tweet = re.sub(r"@\w+|#", "", tweet)
    tweet = re.sub(r"[^\w\s]|[\d]", "", tweet)
    tweet = " ".join([stemmer.stem(word) for word in tweet.split() if word not in stop_words])
    return tweet

df["clean_tweet"] = df["tweet"].apply(clean_tweet)
df.head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,tweet,sentiment,clean_tweet
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",0,awww that bummer shoulda got david carr third day
1,is upset that he can't update his Facebook by ...,0,upset cant updat facebook text might cri resul...
2,@Kenichan I dived many times for the ball. Man...,0,dive mani time ball manag save rest go bound
3,my whole body feels itchy and like its on fire,0,whole bodi feel itchi like fire
4,"@nationwideclass no, it's not behaving at all....",0,behav im mad cant see


# 3. Exploratory Data Analysis - EDA
**After cleaning the data, the next step is to understand its characteristics using statistical analysis and visualizations.**

> **"First, we'll convert the sentiment labels to their original meanings for better readability:**
>
> * 0 → Negative 😠
> * 4 → Positive 😊

In [None]:
df["sentiment"] = df["sentiment"].replace({0: "Negative", 4: "Positive"})
df.head()

> 1. **Sentiment Distribution**
> * Objective: Determine whether the dataset is balanced or biased toward a particular sentiment.
> * We'll visualize the count of each sentiment class using pie chart

In [None]:
import matplotlib.pyplot as plt

df["sentiment"].value_counts().plot(kind="pie",
                                    autopct='%1.1f%%',
                                    pctdistance=0.85,
                                    startangle=90,
                                    colors=["lightcoral", "lightgreen"],
                                    wedgeprops={'edgecolor': 'black'})

plt.title('Sentiment Distribution')
plt.axis('equal')
plt.legend(labels=df["sentiment"].value_counts().index,
           loc='upper right', fontsize=9)
plt.show()


> **2. Most Common Words in Each Sentiment (Word Clouds)**
> * **Objective:** Identify words commonly associated with positive and negative sentiments.
> * **A word cloud** is a great way to see the most frequent words in each sentiment category.
> * This helps us understand what kind of words contribute to positive or negative emotions.

> **"Let's start with generating the word cloud for positive sentiment."
In this step, we'll extract tweets that are labeled as positive and create a word cloud to visually represent the most common words used.**

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

positive_words = " ".join(df[df["sentiment"] == "Positive"]["clean_tweet"])
negative_words = " ".join(df[df["sentiment"] == "Negative"]["clean_tweet"])

wordcloud = WordCloud(width=800, height=400, background_color="white").generate(positive_words)
plt.figure(figsize=(10,5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.title("Most Frequent Words in Positive Tweets")
plt.show()

> **Now, let's move on to generating the word cloud for negative sentiment."
In this step, we'll extract tweets labeled as negative and create a word cloud that highlights the most common words used.**

In [None]:
wordcloud = WordCloud(width=800, height=400, background_color="white").generate(negative_words)
plt.figure(figsize=(10,5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.title("Most Frequent Words in Negative Tweets")
plt.show()

> **Note:**
>> **After displaying the results, you'll notice that some words are common across both positive and negative sentiment classes. However, their frequency of occurrence varies between the two classes.**

>  **3. Tweet Length Distribution**
> * Objective: Analyze the average length of tweets and its impact on sentiment.
> * We'll visualize the result using histogram and bar chart

> **First, we need to add a new feature to our dataset: the length of the tweets.**

In [None]:
df["tweet_length"] = df["clean_tweet"].apply(lambda x: len(x.split()))
df.head()

> **Let's filter the data to check if there are any tweets with a length of zero after cleaning them using the `clean_tweet` function. We'll then compare these tweets with their original versions before cleaning."**

> **After displaying the results, we will notice that there are 7090 tweets with a length of zero after cleaning, meaning they are empty. Upon examining the original content, we can see that most of these tweets only contained user mentions (e.g., `@user`), which were removed during the cleaning process.**

In [None]:
zero_len = df[df["tweet_length"]==0][["tweet" ,"clean_tweet","tweet_length"]]
display(zero_len.shape)
zero_len.head()

> **Now, we will remove the empty tweets as well as any duplicates from the dataset.**

In [None]:
df = df[df["tweet_length"] != 0].reset_index(drop = True)
df = df.drop_duplicates(subset=['clean_tweet'], keep='first')
display(df.shape)

> **Now, let's calculate the average tweet length for both positive and negative sentiments.**

In [None]:
tweet_len = df.groupby(["sentiment"]).agg(
   mean = ("tweet_length", "mean")
)

tweet_len = tweet_len.transpose()
tweet_len

> **After displaying the distribution using a histogram and comparing the mean tweet lengths for positive and negative sentiments with a bar chart, it is evident that negative tweets are generally longer than positive ones.**

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(12, 10))


df[df["sentiment"]=="Negative"]["tweet_length"].hist(color="lightcoral", edgecolor="black", ax=axes[0, 0], bins=30)
df[df["sentiment"]=="Positive"]["tweet_length"].hist(color="lightgreen", edgecolor="black", ax=axes[0, 1], bins=30)
tweet_len.plot(kind="bar", color = ("lightcoral", "lightgreen"), ax = axes[1, 0] )

axes[0, 0].set_title("Negative Tweet Length Distribution")
axes[0, 1].set_title("Positive Tweet Length Distribution")
axes[1, 0].set_title("Positive VS Negative")
fig.delaxes(axes[1,1])

plt.tight_layout()
plt.show()

# 3. Data Modeling
**"Data Modeling" is the next phase in our pipeline.
In this stage, we build and train models that can predict sentiment based on the processed tweet text.**

> ####  Here's an overview of what we typically do:
>
> 1. **Feature Extraction:** Convert cleaned tweet text into numerical features using techniques such as `TF-IDF`, `bag-of-words`, or `word embeddings`.
> 2. **Splitting the Data:** Divide the dataset into training and testing subsets to evaluate model performance.
> 3. **Model Selection:** Choose a suitable algorithm (like logistic regression, SVM, or deep learning models) for sentiment classification.
> 4. **Training the Model:** Fit the chosen model on the training data.
> 5. **Evaluation:** Assess model performance on the test set using metrics like accuracy, precision, recall, and F1-score.

> 1. **Feature Extraction:** In this stage, we will use the `Bag-of-Words` technique.
>
>* `Bag-of-Words` is a method for converting text data into numerical features that can be used by machine learning algorithms. The process involves:
>* **Vocabulary Creation:** Building a list of all unique words found in the corpus (in this case, all tweets).
>* **Vectorization:** Representing each tweet as a vector, where each element corresponds to the count (or frequency) of a word from the vocabulary in that tweet.
>* **Simplicity and Efficiency:** This technique ignores the grammar and word order, focusing solely on the occurrence of words, which makes it simple yet effective for many text classification tasks such as sentiment analysis.

![](https://vitalflux.com/wp-content/uploads/2021/08/Bag-of-words-technique-to-convert-to-numerical-feature-vector-png-640x212.png)

> **After applying the Bag-of-Words technique, we observed that 373,727 unique words (features) were extracted from the dataset. This represents the total vocabulary size used for vectorization.**

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=100000)
X = cv.fit_transform(df["clean_tweet"])
len(cv.get_feature_names_out())

> **Now, we'll store our target variable in `y` after converting the labels to binary: negative becomes 0 and positive becomes 1.**

In [None]:
y = df["sentiment"].map({"Negative": 0, "Positive": 1})

> 2. **Splitting the Data to train and test**

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, y, random_state =0)

## Model with LogisticRegression
  

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

scaler = StandardScaler(with_mean=False)
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

model = LogisticRegression(max_iter=500, solver='saga')
model.fit(x_train_scaled, y_train)

In [None]:
y_pred = model.predict(x_test_scaled)
y_pred

## Model Evaluation (LogisticRegression)

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy*100:.4f}%')

conf_matrix = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:')
print(conf_matrix)

class_report = classification_report(y_test, y_pred)
print('Classification Report:')
print(class_report)

In [None]:
plt.figure(figsize=(6, 4))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=["No", "Yes"], yticklabels=["No", "Yes"])

plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix Visualization")
plt.show()

> **Now, let's proceed and predict the sentiment of some sentences from our own input.**

In [None]:
sent = pd.DataFrame({"tweet": ["I am very happy today", "lol, i have depression"]})
sent["clean_tweet"] = sent["tweet"].apply(clean_tweet)
sent

> **After displaying the results, we see that the model correctly predicted the sentiment for both sentences.**

In [None]:
pre = cv.transform(sent["clean_tweet"])
pre = scaler.transform(pre)
predict_sent = model.predict(pre)
predict_sent