# Task: Goal of this project is to predict the number of positive and negative reviews using classification
### Implementation:
### Preprocess Text Data(Remove punctuation, Perform Tokenization, Remove stopwords and Lemmatize/Stem)
### Perform TFIDF Vectorization
### Exploring parameter settings using GridSearchCV on Random Forest & Gradient Boosting Classifier. Use Xgboost instead of Gradient Boosting if it's taking a very long time in GridSearchCV
### Perform Final evaluation of models on the best parameter settings using the evaluation metrics
### Report the best performing model

In [None]:
import warnings
warnings.filterwarnings('ignore')
import os,pandas as pd,re,string
from sklearn.model_selection import train_test_split,cross_val_score,GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
st=stopwords.words('english')
df=pd.read_csv('IMDB_dataset.csv')
df.head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,review,sentiment
0,I thought this was a wonderful way to spend ti...,positive
1,"Probably my all-time favorite movie, a story o...",positive
2,I sure would like to see a resurrection of a u...,positive
3,"This show was an amazing, fresh & innovative i...",negative
4,Encouraged by the positive comments about this...,negative


## Checking for Null Values

In [4]:
if df.isnull().sum().sum()==0:
    print('No Null Values Found')
    print(df.isna().sum())
else:
    print("Null Values Found In the Dataset")
    print(df.isna().sum())




No Null Values Found
review       0
sentiment    0
dtype: int64


## Preprocess Text Data(Remove punctuation, Perform Tokenization, Remove stopwords and Lemmatize/Stem)

### Create a function to  to clean data

In [5]:
Stem=PorterStemmer()
# Cleaning Function
def Clean_Text(words:str):
    words="".join([word for word in words if word not in string.punctuation ])
    split=re.split(r"\W+",words)
    return[Stem.stem(word) for word in split if word not in st]
# Percentage of stopwords in the text
def Percetage(words:str):
    count=sum([1 for i in words if i in string.punctuation])
    return round(count/(len(words)-words.count(' ')),3)*100

In [6]:
df['sentiment'].unique()

array(['positive', 'negative'], dtype=object)

In [7]:
## Add The Length of Text and Percentage of the stopwords
df['sentiment']=df['sentiment'].map({'positive':1,'negative':0})
df['Length']=df['review'].apply(lambda x:len(x)-x.count(' '))
df['Percentage_']=df['review'].apply(lambda x:Percetage(x))
df.head().sort_values(by='Percentage_',ascending=False)

Unnamed: 0,review,sentiment,Length,Percentage_
4,Encouraged by the positive comments about this...,0,552,5.6
0,I thought this was a wonderful way to spend ti...,1,761,5.3
1,"Probably my all-time favorite movie, a story o...",1,538,5.2
3,"This show was an amazing, fresh & innovative i...",0,761,4.3
2,I sure would like to see a resurrection of a u...,1,577,2.1


### Perform TFIDF Vectorization

In [8]:
# insert the clean function into the Tfid vectorizer
Tfidf=TfidfVectorizer(analyzer=Clean_Text,max_features=7000)
X_=Tfidf.fit_transform(df['review'])
print(X_.get_shape())
Tfidf_df=pd.concat([df['Length'],df['Percentage_'],pd.DataFrame(X_.toarray())],axis=1)
del X_
Tfidf_df.columns=Tfidf_df.columns.astype(str)
Tfidf_df.head()

(25000, 7000)


Unnamed: 0,Length,Percentage_,0,1,2,3,4,5,6,7,...,6990,6991,6992,6993,6994,6995,6996,6997,6998,6999
0,761,5.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,538,5.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,577,2.1,0.0,0.0,0.0,0.0,0.10089,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,761,4.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,552,5.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
import sys
print("Data Size: ",round(((sys.getsizeof(Tfidf_df)/1024)/1024)/1024,1),"Gb")

Data Size:  1.3 Gb


## Exploring parameter settings using GridSearchCV on Random Forest & Gradient Boosting Classifier. Use Xgboost instead of Gradient Boosting if it's taking a very long time in GridSearchCV

## GridSearchCv On RandomForestClassifier
### Perform Final evaluation of models on the best parameter settings using the evaluation metrics

In [11]:
X_train, X_test, y_train, y_test = train_test_split(Tfidf_df, df['sentiment'], test_size=0.2, random_state=42)

In [12]:
import warnings

rf=RandomForestClassifier()
param={'n_estimators': [50,100],
    'max_depth': [3, 5,None],
    'min_samples_split':[2,5]}
Grid=GridSearchCV(estimator=rf,param_grid=param,n_jobs=-1,cv=3)
Grid.fit(X_train,y_train)
print("Best_estimator ",Grid.best_estimator_,"best_params_ ",Grid.best_params_,"best_score_ ",Grid.best_score_)
best_Model=Grid.best_estimator_
y_pred=best_Model.predict(X_test)
print("Evaluation Of RandomForestClassifier")
print(classification_report(y_test,y_pred))



Best_estimator  RandomForestClassifier(min_samples_split=5) best_params_  {'max_depth': None, 'min_samples_split': 5, 'n_estimators': 100} best_score_  0.8446500342249613
Evaluation Of RandomForestClassifier
              precision    recall  f1-score   support

           0       0.85      0.84      0.84      2527
           1       0.84      0.85      0.84      2473

    accuracy                           0.84      5000
   macro avg       0.84      0.84      0.84      5000
weighted avg       0.84      0.84      0.84      5000



## GridSearchCv On XGBClassifier
### Perform Final evaluation of models on the best parameter settings using the evaluation metrics

In [15]:
from xgboost import XGBClassifier
xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss')

param_grid = {
    'n_estimators': [50, 100],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5,None],
    'min_child_weight': [1, 3, 5],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}
grid_search = GridSearchCV(estimator=xgb, param_grid=param_grid,
                           scoring='accuracy', cv=3, n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)
print("Best_estimator ",grid_search.best_estimator_,"best_params_ ",grid_search.best_params_,"best_score_ ",grid_search.best_score_)

Best_Model=grid_search.best_estimator_
y_pred=Best_Model.predict(X_test)
print("Evaluation Of XGBClassifier")
print(classification_report(y_test,y_pred))

Fitting 3 folds for each of 288 candidates, totalling 864 fits
Best_estimator  XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=1.0, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.2, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=9,
              max_leaves=None, min_child_weight=5, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=200,
              n_jobs=None, num_parallel_tree=None, random_state=None, ...) best_params_  {'colsample_bytree': 1.0, 'learning_rate': 0.2, 'max_depth': 9, 'min_child_weight': 5, 'n_estimators': 200, 'subsample': 0.8} best_score_  0.861800204319

In [None]:
import joblib
joblib.dump(Best_Model,'XGBClassifier.pkl')
joblib.dump(best_Model,'RandomForestClassifier.pkl')

['/content/XGBClassifier.pkl']

## Report the best performing model , the XGBClassifier is the better-performing model. Here's a summary of the key metrics for both models:

### RandomForestClassifier
* Precision: 0.85 (0) / 0.84 (1)
* Recall: 0.84 (0) / 0.85 (1)
* F1-Score: 0.84 (both classes)
* Accuracy: 0.84

### XGBClassifier
* Precision: 0.88 (0) / 0.85 (1)
* Recall: 0.85 (0) / 0.88 (1)
* F1-Score: 0.86 (both classes)
* Accuracy: 0.86

Conclusion
* Best Model: XGBClassifier
#### Reason: It has higher precision for class 0 (0.88 vs. 0.85) and maintains a balanced performance with slightly better recall for class 1 (0.88 vs. 0.85), resulting in a higher overall F1-score (0.86 vs. 0.84) and accuracy (0.86 vs. 0.84).