# Natural Language Processing

This notebook runs with the Python 3.9 kernel and can be easily downloaded to run, test, and modify locally. It uses data from a [CSV file](https://www.kaggle.com/datasets/kashnitsky/hierarchical-text-classification?select=train_40k.csv) containing 40000 Amazon reviews. It currently uses text input to predict the score of a given review, but this can be changed by replacing "Score" references to any other column, such as the class names Cat1, Cat2 or Cat3.

### 1. Installation
These libraries will need to be installed on the machine to be used later

In [1]:
pip install scikit-learn numpy pandas

Note: you may need to restart the kernel to use updated packages.


### 2. Import Libraries
This notebook uses numpy, pandas, and scikit-learn for data analysis

In [2]:
import numpy as np
import pandas as pd
import re
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.svm import LinearSVC

### 3. Load and Preprocess Data
The dataset used is train_40k.csv from [this Kaggle page](https://www.kaggle.com/datasets/kashnitsky/hierarchical-text-classification). Loading the dataset into a pandas dataframe (df) allows for preprocessing by focusing on relevant columns (review text and rating) so that the data can be cleaned up by removing punctuation.

In [None]:
df_original = pd.read_csv('/Users/xiaoa1/VSCode/ds/_notebooks/data/train_40k.csv')

columns = ['Text', 'Score']

df = shuffle(df_original[columns])

df.Score.value_counts()

p = re.compile(r'[^\w\s]+')

df['Text'] = [p.sub('', x) for x in df['Text'].tolist()]

df.apply(lambda x: x.astype(str).str.lower())

Unnamed: 0,Text,Score
21547,background,5.0
6165,i have also fallen in love with these cookies ...,5.0
22004,i love these they taste great and they give yo...,5.0
32478,this is a great fragance top quality and not v...,4.0
31508,this is a wonderful progressive puzzle from 2 ...,5.0
...,...,...
7100,bought one set for my self picked up two more ...,5.0
30207,first off i must tell you that this is a perfe...,5.0
27540,i love doing puzzles as they are very relaxing...,3.0
19405,i read all the reviews here and elsewhere befo...,4.0


### 4. Split the Dataset
scikit-learn has a feature to split data for training and testing

In [None]:
x,y = df.Text, df.Score
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.2, random_state=100)

### 5. Feature Extraction
Convert the text data into numerical feature vectors that can be used by machine learning algorithms. Use scikit-learn's CountVectorizer to convert text into a matrix of token counts:

In [None]:
pipeline = Pipeline([('vect', TfidfVectorizer(ngram_range=(1,2), stop_words='english', sublinear_tf=True)),
                     ('chi', SelectKBest(chi2, k=10000)),
                     ('clf', LinearSVC(C=1.0, penalty='l1',max_iter=3000, dual=False))
                    ])

### 6. Model Training

In [None]:
model = pipeline.fit(train_x, train_y)

### 7. Model Evaluation
Using the score function to rate the accuracy of the model

In [None]:
print('accuracy score: '+ str(model.score(test_x, test_y)))

accuracy score: 0.64675


### 8. Prediction
Now we can use the predict function to input our own written Amazon review and see what the predicted score is!

In [None]:
print(model.predict(['price is good but tasted a bit stale'])) # type your review here to see the predicted rating! exclude punctuation

[4.]
