# Analysing Samsung Internal SSD Reviews


![download.jpg](attachment:download.jpg)

# Problem statement

Here we were provided with a dataset of Samsung internal ssd customers reviews, And we will try to understand if the customer is happy or not by doing sentiment analysis to the customers reviews.<br><br><br>

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
#sklearn
from sklearn.model_selection import KFold, train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn import metrics, svm
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier



# To ignore unwanted warnings
import warnings
warnings.filterwarnings('ignore')

In [1]:
df = pd.read_csv('../input/ssd-reviews/ssd_reviews.csv', index_col=0)

# Data Exploratory

In [1]:
df.head()

In [1]:
df.info()

#### Number of null values in each column

We can see below that overall_review column has a lot of null values, Also we can see that we have 2 columns called 'pros' and 'cons' and they are almost has no null values, So in the feature engineering section I will explain how we can benefit from that

In [1]:
fig, ax = plt.subplots(figsize = (14, 10))
sns.heatmap(df.isnull(), yticklabels=False, ax = ax, cbar=False, cmap='viridis')
ax.set_title('Customers Reviews')
plt.show()

In [1]:
df['date'] = pd.to_datetime(df['date'])

In [1]:
df['rating_stars'].value_counts()

**Looks like the product doing very well!, The most majority of customers rated the product with 5 stars, And this might be a problem in normal cases because the sentiment analysis model will be bias to the good reviews, Why? because they are above 80% of our data, But like I said this is in a normal case scenario,<br><br>
The good thing here that we can do a simple trick to fix this problem also this trick will double our data, Intresting yes? Alright to discover this trick look at the feature engineering part.**

# Feature Engineering

#### Trick explaining

We saw before how the 'overall_review' column missing a lot of data, So we can't use it and expect a good result.<br>
What we will do insted is merging 'pros' column with 'cons' column to be in a single column, And what I meeant be merging them is by creating a new column called "pros_and_cons" in a new dataframe and create another column with this column called positive, which will contain 0 and 1, The value will be 1 if the row in "pros_and_cons" column contains pros, And will contain 0 if the row in "pros_and_cons" column contains cons<br>

And how this is will double our data? by puting 'cons' column under 'pros' column in our new column "pros_and_cons", so it will be 4000 row insted of 2000 row.

#### Merging 'pros' and cons in one column in a new dataframe

Prepare cons before merging

In [1]:
df_cons = df[['cons']].dropna()
df_cons['positive'] = 0

Below we are droping any row contains ('none', 'none so far', 'non') because these are not cons, So basicly what the customer meant by writing 'none' here in 'cons' that there is no cons

In [1]:
# a lot of values droped around 700 row in cons
df_cons.drop(df_cons[df_cons['cons'].isin(['none', 'none so far', 'non'])].index, inplace=True)

In [1]:
df_cons.rename(columns={'cons':'pros_and_cons'}, inplace=True)

Prepare pros before merging

In [1]:
#To make our data balanced betwwen'pros' and 'cons', Here in 'pros' we took only 1562 row because a lot of rows droped in 'cons'
df_pros = df[['pros']][:1562].dropna()

In [1]:
df_pros.rename(columns={'pros':'pros_and_cons'}, inplace=True)

In [1]:
df_pros['positive'] = 1

Merging 'pros' and 'cons'

In [1]:
merged_df = pd.merge(left=df_pros, right=df_cons, left_on=['pros_and_cons', 'positive'],
                     right_on=['pros_and_cons', 'positive'], how='outer')

And here our final data before modeling

In [1]:
merged_df

We can see below our data now balanced

In [1]:
merged_df['positive'].value_counts()

# Modeling

In [1]:
X = merged_df.drop('positive', axis=1)
y = merged_df['positive']

In [1]:
bow_f = CountVectorizer(stop_words='english').fit(X['pros_and_cons'])

In [1]:
print("After eliminating stop words: ", len(bow_f.get_feature_names()))

In [1]:
bow_transform = bow_f.transform(X['pros_and_cons'])

In [1]:
count_vect_df = pd.DataFrame(bow_transform.todense(), columns=bow_f.get_feature_names())
np.sum(count_vect_df).sort_values(ascending=False)[0:20]

In [1]:
X_train, X_test, y_train, y_test = train_test_split(bow_transform, y, test_size=0.3, random_state=42)

In [1]:
#8
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
print('train score' , logreg.score(X_train, y_train))
print('test score' , logreg.score(X_test, y_test))
y_pred = logreg.predict(X_test)

In [1]:
confusion_matrix(y_test, y_pred)

## TF-IDF
Let's see if TF-IDF improves the accuracy.

In [1]:
tfidf_vectoriser = TfidfVectorizer(stop_words='english')
tfidf_f = tfidf_vectoriser.fit(X['pros_and_cons'])
tfidf_transform = tfidf_f.transform(X['pros_and_cons'])

In [1]:
tf_X_train, tf_X_test, y_train, y_test = train_test_split(tfidf_transform, y, test_size=0.3)

In [1]:
tf_logreg = LogisticRegression(C=3.2)
tf_logreg.fit(tf_X_train, y_train)
print('train score' , tf_logreg.score(tf_X_train, y_train))
print('test score' , tf_logreg.score(tf_X_test, y_test))

y_pred = tf_logreg.predict(tf_X_test)

In [1]:
cv=KFold(n_splits=5, shuffle=True, random_state=1)
cross_val_score(tf_logreg, tfidf_transform, y, cv=cv).mean()

In [1]:
confusion_matrix(y_test, y_pred)

In [1]:
knn_classifier = KNeighborsClassifier()  
knn_classifier.fit(tf_X_train, y_train)
print(knn_classifier.score(tf_X_train, y_train))
print (knn_classifier.score(tf_X_test, y_test))
y_pred = knn_classifier.predict(tf_X_test)

In [1]:
confusion_matrix(y_test, y_pred)

In [1]:
tree= DecisionTreeClassifier()
tree.fit(tf_X_train, y_train)
print('test score' , tree.score(tf_X_train, y_train))
print('test score' , tree.score(tf_X_test, y_test))
y_pred = tree.predict(tf_X_test)

In [1]:
confusion_matrix(y_test, y_pred)

In [1]:
svm_linear = svm.SVC(kernel='linear')
svm_linear.fit(tf_X_train, y_train)
print('Train : ', svm_linear.score(tf_X_train, y_train))
print('Test: ', svm_linear.score(tf_X_test, y_test))
y_pred = svm_linear.predict(tf_X_test)

In [1]:
confusion_matrix(y_test, y_pred)

In [1]:
cv=KFold(n_splits=5, shuffle=True, random_state=1)
cross_val_score(svm_linear, tfidf_transform, y, cv=cv).mean()

##### <span style="color:blue">The Support Vector Machine(svm) result is the best</span>

In [1]:
svm_rbf = svm.SVC(kernel='rbf',C=1, probability=True)
svm_rbf.fit(tf_X_train, y_train)
print('Train : ', svm_rbf.score(tf_X_train, y_train))
print('Test: ', svm_rbf.score(tf_X_test, y_test))
y_pred = svm_rbf.predict(tf_X_test)

In [1]:
cv=KFold(n_splits=5, shuffle=True, random_state=1)
cross_val_score(svm_rbf, tfidf_transform, y, cv=cv).mean()

In [1]:
confusion_matrix(y_test, y_pred)

In [1]:
svm_poly = svm.SVC(kernel='poly', C=.7)
svm_poly.fit(tf_X_train, y_train)
print('Train : ', svm_poly.score(tf_X_train, y_train))
print('Test: ', svm_poly.score(tf_X_test, y_test))
y_pred = svm_poly.predict(tf_X_test)

In [1]:
cv=KFold(n_splits=5, shuffle=True, random_state=1)
cross_val_score(svm_poly, tfidf_transform, y, cv=cv).mean()

In [1]:
confusion_matrix(y_test, y_pred)

In [1]:
randomF = RandomForestClassifier()
randomF.fit(tf_X_train, y_train)
print('Train score :',randomF.score(tf_X_train, y_train))
print('Ttest score :',randomF.score(tf_X_test, y_test))
y_pred = randomF.predict(tf_X_test)

In [1]:
cv=KFold(n_splits=5, shuffle=True, random_state=1)
cross_val_score(randomF, tfidf_transform, y, cv=cv).mean()

In [1]:
confusion_matrix(y_test, y_pred)

In [1]:
gnb = GaussianNB(var_smoothing=0.11) 
gnb.fit(tf_X_train.toarray(), y_train) 
print('Train score :',gnb.score(tf_X_train.toarray(), y_train))
print('Ttest score :',gnb.score(tf_X_test.toarray(), y_test))
y_pred = gnb.predict(tf_X_test.toarray())

In [1]:
confusion_matrix(y_test, y_pred)

In [1]:
cv=KFold(n_splits=5, shuffle=True, random_state=1)
cross_val_score(gnb, tfidf_transform.toarray(), y, cv=cv).mean()

In [1]:
mnb = MultinomialNB(alpha=0.22) 
mnb.fit(tf_X_train.toarray(), y_train) 
print('Train score :',mnb.score(tf_X_train.toarray(), y_train))
print('Ttest score :',mnb.score(tf_X_test.toarray(), y_test))
y_pred = mnb.predict(tf_X_test.toarray())

In [1]:
confusion_matrix(y_test, y_pred)

<b>Test manually</b>

In [1]:
tfidf_comments = tfidf_f.transform(['expensive'])
svm_rbf.predict(tfidf_comments)
round(svm_rbf.predict_proba(tfidf_comments)[0][1], 5)

# Conclusion
After we analysed Samsung internal ssd customers reviews, And after we build a sentiment analysis model, And after we test our model with cross validation We got the following:<br>
- Our model can classifie any new ssd review with 88% Accuracy

<br>
And finally after we test a lot of models the model we chose to be the best model in this case is: The Support Vector Machine(svm) with 'rbf' kernal.

I hope you enjoyed reading this notebook, Have a great day.