# Natural Language Processing

This notebook runs with the Python 3.9 kernel and can be easily downloaded to run, test, and modify locally. It uses data from a [CSV file](https://www.kaggle.com/datasets/kashnitsky/hierarchical-text-classification?select=train_40k.csv) containing 40000 Amazon reviews. It currently uses text input to predict the score of a given review, but this can be changed by replacing "Score" references to any other column, such as the class names Cat1, Cat2 or Cat3.

### 1. Installation
These libraries will need to be installed on the machine to be used later

In [1]:
pip install scikit-learn numpy pandas


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.10 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### 2. Import Libraries
This notebook uses numpy, pandas, and scikit-learn for data analysis

In [2]:
import numpy as np
import pandas as pd
import re
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.svm import LinearSVC

### 3. Load and Preprocess Data
The dataset used is train_40k.csv from [this Kaggle page](https://www.kaggle.com/datasets/kashnitsky/hierarchical-text-classification). Loading the dataset into a pandas dataframe (df) allows for preprocessing by focusing on relevant columns (review text and rating) so that the data can be cleaned up by removing punctuation.

In [3]:
df_original = pd.read_csv('/Users/xiaoa1/VSCode/ds/_notebooks/data/train_40k.csv')

columns = ['Text', 'Score']

df = shuffle(df_original[columns])

df.Score.value_counts()

p = re.compile(r'[^\w\s]+')

df['Text'] = [p.sub('', x) for x in df['Text'].tolist()]

df.apply(lambda x: x.astype(str).str.lower())

Unnamed: 0,Text,Score
15121,this product has a nice tone and really covers...,5.0
37624,i placed my order and recieved exactly what i ...,4.0
32824,these cost 200 from the grocery store no reaso...,2.0
4019,my 3 12 year old daughter received this as a c...,2.0
23645,these are great for potty training little boys...,5.0
...,...,...
13774,definitely not worth all the hype i dont think...,3.0
22049,this game is a cute boardgame geared towards p...,4.0
6740,its hard to open even harder to close im afrai...,1.0
36539,i bought this makeup in nyc at henri bendel ab...,2.0


### 4. Split the Dataset
scikit-learn has a feature to split data for training and testing. The test size and random state can be adjusted to increase accuracy.

In [4]:
x,y = df.Text, df.Score
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.2, random_state=4000)

### 5. Building a Pipeline
Pipelines are an iterative way to build a model to accomodate variations in data and improve accuracy. Because datasets can be so large now, its beneficial for data to be sent through a pipeline so that it can be manipulated

In [5]:
pipeline = Pipeline([('vect', TfidfVectorizer(ngram_range=(1,2), stop_words='english', sublinear_tf=True)),
                     ('chi', SelectKBest(chi2, k=10000)),
                     ('clf', LinearSVC(C=1.0, penalty='l1',max_iter=3000, dual=False))
                    ])

### 6. Model Training

This fits the pipeline to our data

In [6]:
model = pipeline.fit(train_x, train_y)

### 7. Model Evaluation
Using the score function to rate the accuracy of the model

In [7]:
print('accuracy score: '+ str(model.score(test_x, test_y)))

accuracy score: 0.647


### 8. Prediction
Now we can use the predict function to input our own written Amazon review and see what the predicted score is!

In [8]:
print(model.predict(['price is good but tasted a bit stale'])) # type your review here to see the predicted rating! exclude punctuation

[4.]
