<a href="https://colab.research.google.com/github/sakib762/Machine-Learning-Experiment/blob/main/Twitter_Sentiment_ML_08.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# About Dataset
**Context**

This is the sentiment140 dataset. It contains 1,600,000 tweets extracted using the twitter api . The tweets have been annotated (0 = negative, 4 = positive) and they can be used to detect sentiment .


**Content**

It contains the following 6 fields:

target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)

ids: The id of the tweet ( 2087)

date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)

flag: The query (lyx). If there is no query, then this value is NO_QUERY.

user: the user that tweeted (robotickilldozr)

text: the text of the tweet (Lyx is cool)


**Acknowledgements**

The official link regarding the dataset with resources about how it was generated is here
The official paper detailing the approach is here

Citation: Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.

# Importing Dependencies

In [None]:
#importing dependency
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings('ignore')

In [None]:
nltk.download('stopwords')

In [None]:
print(stopwords.words('english'))

# Data Collection

In [None]:
#mounting drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df = pd.read_csv('/content/drive/MyDrive/database/ML Project Database/twitter sentiment.csv', encoding='latin-1')

In [None]:
df

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
#df shape
df.shape

# Data Analysis

In [None]:
#Naming the column
column_names = ["target","id","date","flag","user","text"]
df = pd.read_csv('/content/drive/MyDrive/database/ML Project Database/twitter sentiment.csv', encoding='latin-1', names=column_names)

In [None]:
df.head()

In [None]:
#counting the missing value
df.isnull().sum()

In [None]:
#checking the distribution in the target column
df['target'].value_counts()

In [None]:
#changing value 4 to 1
df['target'] = df['target'].replace(4,1)
df["target"].value_counts()

**Here 0 means Negative twitt and 1 means Positive twitt**

# Steming

**Steaming is the process of reducing a word to it's root word. such as: actor, actress, acting = act**

In [None]:
port_steam = PorterStemmer() #reducing word to root word

In [None]:
#declaring steming function
def stemming(content):
  stemmed_content = re.sub('[^a-zA-Z]',' ',content) #everything wil be removed except a-z,A-Z
  stemmed_content = stemmed_content.lower() #converting everything to lowercase
  stemmed_content = stemmed_content.split() #spliting words
  stemmed_content = [port_steam.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
  stemmed_content = ' '.join(stemmed_content)
  return stemmed_content

In [None]:
df['stem_text'] = df['text'].apply(stemming)

In [None]:
df.head()

In [None]:
print(df['stem_text'])

In [None]:
print(df["target"])

# Data Spliting

In [None]:
#separating data and label
x = df['stem_text'].values
y = df['target'].values

In [None]:
#data spliting
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, stratify=y, random_state=2) #stratify means equal distribution of unique data in test and trainig data , x,y train contains same amount of 0 or 1.


In [None]:
print(x.shape, x_train.shape, x_test.shape)

# Converting Actual Data to Numeric

In [None]:
#converting actual data to numeric
vectorizer = TfidfVectorizer()
vectorizer.fit(x_train)

x_train = vectorizer.fit_transform(x_train) #fit transform only for train data
x_test = vectorizer.transform(x_test)

# Training Model

**Logistic Regression**

In [None]:
model = LogisticRegression()

In [None]:
model.fit(x_train, y_train)

# Model Evaluation

In [None]:
#accuracy score on the training data
x_train_prediction = model.predict(x_train)
training_data_accuracy = accuracy_score(x_train_prediction, y_train)
print("Training Data Accuracy: ", training_data_accuracy)

In [None]:
#accuracy score on the test data
x_test_prediction = model.predict(x_test)
test_data_accuracy = accuracy_score(x_test_prediction, y_test)
print("Test Data Accuracy: ", test_data_accuracy)

**As the training data and test data accuracy is quite close so we can say that, this model is working well.**

# Model Saving

In [None]:
import pickle

In [None]:
filename = "trained_model.sav"
pickle.dump(model, open(filename, 'wb'))

# Using Model for Future Prediction

In [None]:
#loading the model
loaded_model = pickle.load(open(filename, 'rb'))


In [None]:
X_name = x_test[200]
prediction = loaded_model.predict(X_name)
print(prediction)

if prediction[0] == 0:
  print("Negative Tweet")
else:
  print("Positive Tweet")