## Section III Text Classifiers in Python

#### 1. **Scikit-Learn**

##### **(1) Background**
- Open source machine learning library
- More programmatic interface (compared to waka)
##### **(2)Application**
- **Using Sklearn's *Na&#239;ve Bayes Classifier***:

    ```python
    from sklearn import naive_bayes # Import Naive Bayes Model
    clfrNB_mn = naive_bayes.MultinomialNB() # Use the multinomial Naive Bayes model
    clfrNB_bnl = naive_bayes.BernoulliNB() # Use the Bernoulli Naive Bayes model
    clfrNB.fit(train_data,train_labels) # Train the classifier using features and labels from the training data

    # these two steps can be merged as:
    clfrNB = naive_bayes.MultinomialNB.fit(train_data, train_labels)

    predicted_labels = clfrNB.predict(test_data) # Use fitted model to predict the label for the test dataset

    metrics.f1_score(test_labels, predict_labels, average = "micro") # Use F1 score to evaluate the performance of the model based on the predicted and actual test labels
    ```
- **Using Sklearn's *SVM Classifier***:
    ```python
    from sklearn import svm # Import SVM Model
    clfrSVM = svm.SVC(kernel="linear", C=0.1) # Use a linear kernel and 0.1 of regularization factor for the SVM classifier
    clfrSVM.fit(train_data, train_labels) # Train the classifier using features and labels from the training data
    predicted_labels = clfrNB.predict(test_data) # Use fitted model to predict the label for the test dataset
    ```

##### **(3) Model selection**
- *Train-test split*
    ```python
    from sklearn import model_selection
    X_train, X_test, y_train, y_test = model_selection.train_test_split(train_data, train_labels, test_size = 0.333, random_state = 0) # Use a 2/3-1/3 split for training and test sets of the total dataset
    ```
    - **Way to perform**: Split the training data using a preset fraction into a group of data for <u>training the model</u> (training data) and a group of data for <u>tuning the model</u> (test data)
    - **Advantage**: easy to perform
    - **Disadvantage**: loses the fraction of test data for model tuning
- *Cross validation*
    ```python
    predicted_labels = model_selection.cross_val_predict(clfrSVM, train_data, train_labels, cv=5) # perform a 5-fold cross validation on the training data
    ```
    - **Way to perform**: split the training data to a number of folds, hold one portion of the folds out for testing, use the remaining portions for training, and repeat it for the number of folds times repeatedly for every group of data
    - **Advantage**: Improves the accuracy and reduces the variance of the evaluation process; all samples are included for training
    - **Disadvantage**: More complex; more time is needed
  

#### 2. **NLTK**

##### **(1) Availability**
- Na&#239;ve Bayes Classifier
- Decision Tree Classifier
- Conditional Exponential Classifier
- Maxent Classifier
- Weka Classifier
- Sklearn Classifer

##### **(2) Application**
- **Using NLTK's *Na&#239;ve Bayes Classifier***:

    ```python
    from nltk.classify import NaiveBayesClassifier # Import Naive Bayes Model
    classifier = NaiveBayesClassifier.train(train_set) # Use the multinomial Naive Bayes model
    classifier.classify(unlabeled_instance) # Use fitted model to predict the label for ONE test sample
    classifier.classify_many(unlabeled_instances) # Use fitted model to predict the label for A SET OF test samples
    
    from nltk.classify import util # Import essential utilities
    util.accuracy(classifier, test_set) # Get the accuracy of the performance of the sklearn classifier
    classifier.labels() # Shows all the labels the classifier has trained on
    classifier.show_most_informative_features() # Shows the most important features for the task
    ```
- **Using NLTK's *Sklearn Classifier* for *SVM Classifier***:
    ```python
    from nltk.classify import SklearnClassifier # Import Sklearn's classifier Model
    from sklearn.svm import SVC
    clfrNB = SklearnClassifier(SVC(), kernel="linear").train(train_set) # Call SVM model by NLTK's SklearnClassifier


#### ***\*Take Home Concepts 3***

$\qquad$ - <u>Scikit-learn</u> is the most commonly used ML toolkit in Python  
$\qquad$ - <u>NLTK</u> has its own Na&#239;ve Bayes implementation  
$\qquad$ - <u>NLTK</u> can also interface with Scikit-learn and Waka

#### 3. **Case Study: *Sentiment Analysis***

- The correct results should be:
    - Percentage of "positively rated" instances: 0.74718
    - Number of instances in train set X_train: 23052
    - Number of features in vect: 19601
    - AUC of logistic regression: 0.89743
    - Number of features after TfldfVectorizer is fit: 5442
    - AUC of logistic regression after TfldfVectorizer: 0.88995
    - Number of features after n-grams: 29072
    - AUC of logistic regression on n-grams: 0.91107

##### **(1) Data Preparation**

In [1]:

import pandas as pd
import numpy as np

df = pd.read_csv('assets/Amazon_Unlocked_Mobile.zip',compression='zip')
df.dropna(inplace=True)
df = df[df['Rating'] != 3] # assume these are neutral
df['Positively Rated'] = np.where(df['Rating']>3, 1, 0) # Return '1' if rating>3, else return '0'

# df = df.sample(frac=0.1, random_state=10)

df.head(10)

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,Positively Rated
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0,1
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0,1
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0,1
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0,1
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0,1
5,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,1,I already had a phone with problems... I know ...,1.0,0
6,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,2,The charging port was loose. I got that solder...,0.0,0
7,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,2,"Phone looks good but wouldn't stay charged, ha...",0.0,0
8,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I originally was using the Samsung S2 Galaxy f...,0.0,1
11,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,This is a great product it came after two days...,0.0,1


In [24]:
np.average(df['Positively Rated'])

0.7482686025879323

In [25]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['Reviews'],df['Positively Rated'], random_state = 0)

In [26]:
print('X_train first entry:\n\n', X_train[0])
print('\n\nX_train shape:', X_train.shape)
print('\n\nX_train first entry\'s label:',y_train[0])

X_train first entry:

 I feel so LUCKY to have found this used (phone to us & not used hard at all), phone on line from someone who upgraded and sold this one. My Son liked his old one that finally fell apart after 2.5+ years and didn't want an upgrade!! Thank you Seller, we really appreciate it & your honesty re: said used phone.I recommend this seller very highly & would but from them again!!


X_train shape: (231207,)


X_train first entry's label: 1


##### **(2) CountVectorizer**

In [27]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer().fit(X_train)

In [28]:
vect.get_feature_names_out()[::2000]

array(['00', '4less', 'adr6275', 'assignment', 'blazingly', 'cassettes',
       'condishion', 'debi', 'dollarsshipping', 'esteem', 'flashy',
       'gorila', 'human', 'irullu', 'like', 'microsaudered',
       'nightmarish', 'p770', 'poori', 'quirky', 'responseive', 'send',
       'sos', 'synch', 'trace', 'utiles', 'withstanding'], dtype=object)

In [29]:
len(vect.get_feature_names_out())

53216

In [30]:
X_train_vectorized = vect.transform(X_train)
X_train_vectorized

<231207x53216 sparse matrix of type '<class 'numpy.int64'>'
	with 6117776 stored elements in Compressed Sparse Row format>

In [31]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [32]:
from sklearn.metrics import roc_auc_score

predictions = model.predict(vect.transform(X_test))

print("AUC: ", roc_auc_score(y_test, predictions))

AUC:  0.9197254713325582


In [33]:
feature_names = vect.get_feature_names_out()

sorted_coef_index = model.coef_[0].argsort()

print('Top 10 smallest coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Top 10 largest coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Top 10 smallest coefs:
['worst' 'garbage' 'junk' 'unusable' 'false' 'worthless' 'useless'
 'crashing' 'disappointing' 'awful']

Top 10 largest coefs: 
['excelent' 'excelente' 'exelente' 'loving' 'loves' 'perfecto' 'excellent'
 'awesome' 'complaints' 'buen']


##### **(3) TFIDF**

In [37]:
from sklearn.feature_extraction.text import TfidfVectorizer

vect2 = TfidfVectorizer(min_df =5).fit(X_train)

In [38]:
len(vect2.get_feature_names_out())

17951

In [40]:
X_train_vectorized2 = vect2.transform(X_train)
X_train_vectorized2

<231207x17951 sparse matrix of type '<class 'numpy.float64'>'
	with 6056695 stored elements in Compressed Sparse Row format>

In [41]:
model2 = LogisticRegression()
model2.fit(X_train_vectorized2, y_train)

predictions2 = model2.predict(vect2.transform(X_test))

print('AUC: ',roc_auc_score(y_test, predictions2))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


AUC:  0.9265848398605042


In [43]:
feature_names2 = vect2.get_feature_names_out()

sorted_tfidf_index2 = X_train_vectorized2.max(0).toarray()[0].argsort()

print('Top 10 smallest TFIDFs:\n{}\n'.format(feature_names2[sorted_tfidf_index2[:10]]))
print('Top 10 largest TFIDFs: \n{}'.format(feature_names2[sorted_tfidf_index2[:-11:-1]]))

Top 10 smallest TFIDFs:
['commenter' 'pthalo' 'warmness' 'storageso' 'aggregration' '1300'
 '625nits' 'a10' 'submarket' 'brawns']

Top 10 largest TFIDFs: 
['defective' 'batteries' 'gooood' 'epic' 'luis' 'goood' 'basico'
 'aceptable' 'problems' 'excellant']


In [44]:
sorted_coef_index2 = model2.coef_[0].argsort()

print('Top 10 smallest coefs:\n{}\n'.format(feature_names2[sorted_coef_index2[:10]]))
print('Top 10 largest coefs: \n{}'.format(feature_names2[sorted_coef_index2[:-11:-1]]))

Top 10 smallest coefs:
['not' 'worst' 'useless' 'disappointed' 'terrible' 'return' 'waste' 'poor'
 'horrible' 'doesn']

Top 10 largest coefs: 
['love' 'great' 'excellent' 'perfect' 'amazing' 'awesome' 'perfectly'
 'easy' 'best' 'loves']


In [45]:
print(model.predict(vect.transform(['not an issue, phone is working','an issue, phone is not working'])))

[0 0]


##### **(4) n-grams**

In [46]:
vect3 = CountVectorizer(min_df=5, ngram_range=(1,2)).fit(X_train)
X_train_vectorized3 = vect3.transform(X_train)

len(vect3.get_feature_names_out())

198917

In [47]:
model3 = LogisticRegression()
model3.fit(X_train_vectorized3, y_train)

predictions3 = model3.predict(vect3.transform(X_test))

print("AUC: ", roc_auc_score(y_test, predictions3))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


AUC:  0.9609307598776071


In [48]:
feature_names3 = vect3.get_feature_names_out()

sorted_coef_index3 = model3.coef_[0].argsort()

print('Top 10 smallest coefs:\n{}\n'.format(feature_names3[sorted_coef_index3[:10]]))
print('Top 10 largest coefs: \n{}'.format(feature_names3[sorted_coef_index3[:-11:-1]]))

Top 10 smallest coefs:
['no good' 'not happy' 'not worth' 'worst' 'junk' 'not satisfied'
 'garbage' 'not good' 'terrible' 'defective']

Top 10 largest coefs: 
['excelent' 'excelente' 'not bad' 'excellent' 'exelente' 'perfect'
 'awesome' 'no problems' 'no issues' 'perfecto']


In [51]:
print(model3.predict(vect3.transform(['not an issue, phone is working','an issue, phone is not working'])))

[1 0]
