## Connect Kaggle using Kaggle API

In [11]:
# upload kaggle.json
from google.colab import files
files.upload()


Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"sinha0810","key":"af3fef05d1979a12611d4ca960fa231f"}'}

In [12]:
# configure the path of kaggle.json
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json


`!` - open shell

``!mkdir -p ~/.kaggle``
- Make directoy (.kaggle) in home directory
- -p: don't give error if already exist

``!cp kaggle.json ~/.kaggle/``
- ``cp`` = copy command
- ``kaggle.json`` = file name
- ``~/.kaggle/`` = destination folder

``!chmod 600 ~/.kaggle/kaggle.json``
- ``chmod`` = change permissions
- ``600`` = owner can read/write; no one else can access

> In short, It puts the Kaggle API key in the correct hidden folder and secures it so Colab can use it.

## 1. Import Dataset

In [13]:
# API to fetch dataset
!kaggle datasets download -d kazanova/sentiment140

Dataset URL: https://www.kaggle.com/datasets/kazanova/sentiment140
License(s): other
Downloading sentiment140.zip to /content
  0% 0.00/80.9M [00:00<?, ?B/s]
100% 80.9M/80.9M [00:00<00:00, 1.35GB/s]


In [15]:
# Extract the .zip file to .csv file
from zipfile import ZipFile
dataset = '/content/sentiment140.zip'

with ZipFile(dataset, 'r') as zip:
  zip.extractall()
  print('The dataset is extracted')

The dataset is extracted


## 2. Import Dependencies

In [None]:
import numpy as np
import pandas as pd
import re

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

import pickle

In [32]:
#import the stopwords
import nltk
nltk.download('stopwords')
# printing the stopwords in english
print(stopwords)

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 3. Data Processing

In [43]:
data = pd.read_csv('/content/training.1600000.processed.noemoticon.csv', encoding='latin1')
data.head()

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


In [44]:
data.shape

(1599999, 6)

1st row is considered as the column name (feature)

In [45]:
# naming the column name and reading the dataset again
column_names = ['target', 'id', 'date', 'flag', 'user', 'text']
data = pd.read_csv('/content/training.1600000.processed.noemoticon.csv', encoding='latin1', names=column_names)
data.head()

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [46]:
data.shape

(1600000, 6)

In [47]:
data.columns

Index(['target', 'id', 'date', 'flag', 'user', 'text'], dtype='object')

In [48]:
# check if missing  value
data.isnull().sum()

Unnamed: 0,0
target,0
id,0
date,0
flag,0
user,0
text,0


In [49]:
# checking distribution of target column
data['target'].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,800000
4,800000


0 - Negative Sentiment


4 - Positive Sentiment

> **Note:** Here we have equal distribution of data, but if the dataset is imbalance, we have to perform upsampling or downsampling for our ML model to work properly.

In [50]:
# Convert the taget value (where 4 --> 1)
data['target'] = data['target'].replace(4, 1)

In [51]:
# checking distribution of target column
data['target'].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,800000
1,800000


0 - Negative Sentiment

1 - Positive Sentiment

### Stemming
- Reduce the word to its root word
- example: actor, actress, acting = act

In [53]:
# perform stemming of the text
stemmer = PorterStemmer()

# create the function
def stemming(content):
  stemmed_content = re.sub('[^a-zA-Z]', ' ', content)   # Remove if it is not a alphabet and remove by " "
  stemmed_content = stemmed_content.lower()             # Lowercase
  stemmed_content = stemmed_content.split()             # Split the content and load into a list
                                                        # remove stopword before stemming
  stemmed_content = [stemmer.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
  stemmed_content = ' '.join(stemmed_content)           # Join so it is one tweet
  return stemmed_content


In [65]:
# apply function to the data -> save in new feature
data['stemmed_content'] = data['text'].apply(stemming)

KeyboardInterrupt: 

In [68]:
data.head()

Unnamed: 0,target,id,date,flag,user,text,stemmed_content
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",switchfoot http twitpic com zl awww bummer sho...
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...,upset updat facebook text might cri result sch...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...,kenichan dive mani time ball manag save rest g...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,whole bodi feel itchi like fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all....",nationwideclass behav mad see


In [64]:
data.shape

(1600000, 7)

In [70]:
# Compare original text vs processed text
print(data.iloc[2]['text'])
print(data.iloc[2]['stemmed_content'])

@Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds
kenichan dive mani time ball manag save rest go bound


 **For analyzing sentiment we need only the 'stemmed_content' and 'target' (0 or 1). Other features are not of our interest**

In [72]:
data['target']

Unnamed: 0,target
0,0
1,0
2,0
3,0
4,0
...,...
1599995,1
1599996,1
1599997,1
1599998,1


In [74]:
data['stemmed_content']

Unnamed: 0,stemmed_content
0,switchfoot http twitpic com zl awww bummer sho...
1,upset updat facebook text might cri result sch...
2,kenichan dive mani time ball manag save rest g...
3,whole bodi feel itchi like fire
4,nationwideclass behav mad see
...,...
1599995,woke school best feel ever
1599996,thewdb com cool hear old walt interview http b...
1599997,readi mojo makeov ask detail
1599998,happi th birthday boo alll time tupac amaru sh...


In [108]:
# Seperating the data and label
X = data['stemmed_content'].values
Y = data['target'].values

In [109]:
print(X) #first 3 and last 3

['switchfoot http twitpic com zl awww bummer shoulda got david carr third day'
 'upset updat facebook text might cri result school today also blah'
 'kenichan dive mani time ball manag save rest go bound' ...
 'readi mojo makeov ask detail'
 'happi th birthday boo alll time tupac amaru shakur'
 'happi charitytuesday thenspcc sparkschar speakinguph h']


In [110]:
print(Y) # first 3 and last 3

[0 0 0 ... 1 1 1]


## 4. Train, Test split

In [166]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=2)
# stratify = Y --> Equal distribution of 'Y' in training and testing
# stratify presenrves the proportion (if before 60:40, then split into same proportion)

In [167]:
# print all the shape of splits
print(X.shape)
print(X_train.shape)
print(X_test.shape)

(1600000,)
(1280000,)
(320000,)


## 5. Feature Extraction
As ML model does not understand factual data. They need numerical data

- **TF-IDF**
  1. **TF: Term Frequency** --> No. of time term appear in 'that' document
  2. **IDF: Inverse Document Frequency** --> Rare word across document's'

  > So, TF-IDF not only consider how many times that word appear in the document(1 tweet), but it also considers how rare that word in across all documents (all tweets)

In [168]:
# Converting textual data into numerical data

vectorizer = TfidfVectorizer()

# Learn vocabulary and IDF weights only from training data
vectorizer.fit(X_train)

# Convert train/test text into TF-IDF numeric features
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

In [169]:
import pickle

with open('vectorizer.pkl', 'wb') as f:
    pickle.dump(vectorizer, f)


In [114]:
print(X_train)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 9453092 stored elements and shape (1280000, 461488)>
  Coords	Values
  (0, 436713)	0.27259876264838384
  (0, 354543)	0.3588091611460021
  (0, 185193)	0.5277679060576009
  (0, 109306)	0.3753708587402299
  (0, 235045)	0.41996827700291095
  (0, 443066)	0.4484755317023172
  (1, 160636)	1.0
  (2, 109306)	0.4591176413728317
  (2, 124484)	0.1892155960801415
  (2, 407301)	0.18709338684973031
  (2, 129411)	0.29074192727957143
  (2, 406399)	0.32105459490875526
  (2, 433560)	0.3296595898028565
  (2, 77929)	0.31284080750346344
  (2, 443430)	0.3348599670252845
  (2, 266729)	0.24123230668976975
  (2, 409143)	0.15169282335109835
  (2, 178061)	0.1619010109445149
  (2, 150715)	0.18803850583207948
  (2, 132311)	0.2028971570399794
  (2, 288470)	0.16786949597862733
  (3, 406399)	0.29029991238662284
  (3, 158711)	0.4456939372299574
  (3, 151770)	0.278559647704793
  (3, 56476)	0.5200465453608686
  :	:
  (1279996, 318303)	0.21254698865277744
  (12

## 6. Training the ML Model
- Logistic Regression Model

In [115]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train, Y_train)

#max_iter = no. of times the model iterate (check loss, adjust weight ....) --> for good accuracy

In [116]:
# use the pickle library to save this model as trained_model.pkl
with open('trained_model.pkl', 'wb') as file:
    pickle.dump(model, file)


## 7. Model Evaluation

In [117]:
# Accuracy score on the training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)
training_data_accuracy

0.79871953125

In [118]:
# Accuracy score on the test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)
test_data_accuracy

0.77668125

Model accuracy = 77.67%

Training data accuracy is almost similar to test data accuracy --> No overfitting or underfitting

## 8. Load the saved model for future prediction

In [119]:
trained_model = pickle.load(open('trained_model.pkl', 'rb'))

In [153]:
def prediction(text):
  pred = trained_model.predict(text)

  if pred == 1:
    return "Positive Sentiment"
  else:
    return "Negative Sentiment"

print(prediction(X_test[200]))
print(prediction(X_test[201]))
print(prediction(X_test[202]))
print(prediction(X_test[203]))
print(prediction(X_test[204]))
print(prediction(X_test[205]))


Positive Sentiment
Negative Sentiment
Positive Sentiment
Positive Sentiment
Positive Sentiment
Positive Sentiment


In [143]:
X_test.shape

(320000, 461488)

In [164]:
# predict the sentiment from this model
# First, preprocess the input text
input_text = "this is not done"
stemmed_input = stemming(input_text)

# Then, transform the stemmed text into numerical features using the trained vectorizer
# The vectorizer expects an iterable (e.g., a list) of documents
input_vector = vectorizer.transform([stemmed_input])

# Now, predict the sentiment using the trained model
prediction_result = trained_model.predict(input_vector)

# Print the prediction
if prediction_result[0] == 1:
    print("Positive Sentiment")
else:
    print("Negative Sentiment")

Positive Sentiment


----

### For requirements.txt file

In [1]:
import streamlit, numpy, pandas, sklearn, nltk, matplotlib

print("streamlit", streamlit.__version__)
print("numpy", numpy.__version__)
print("pandas", pandas.__version__)
print("scikit-learn", sklearn.__version__)
print("nltk", nltk.__version__)
print("matplotlib", matplotlib.__version__)


streamlit 1.52.2
numpy 2.1.3
pandas 2.2.3
scikit-learn 1.6.1
nltk 3.9.1
matplotlib 3.10.0
