# Natural Language Processing

This notebook runs with the Python 3.9 kernel and can be easily downloaded to run, test, and modify locally. It uses data from a [CSV file](https://www.kaggle.com/datasets/kashnitsky/hierarchical-text-classification?select=train_40k.csv) containing 40000 Amazon reviews. It currently uses text input to predict the category of a given review, but this can be changed by replacing "Cat1" references to any other column, such as Score or other classes Cat2 or Cat3.

### 1. Installation
These libraries will need to be installed on the machine to be used later

In [6]:
pip install scikit-learn numpy pandas

Note: you may need to restart the kernel to use updated packages.


### 2. Import Libraries
This notebook uses numpy, pandas, and scikit-learn for data analysis

In [7]:
import numpy as np
import pandas as pd
import re
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.svm import LinearSVC

### 3. Load and Preprocess Data
The dataset used is train_40k.csv from [this Kaggle page](https://www.kaggle.com/datasets/kashnitsky/hierarchical-text-classification). Loading the dataset into a pandas dataframe (df) allows for preprocessing by focusing on relevant columns (review text and rating) so that the data can be cleaned up by removing punctuation.

In [8]:
df_original = pd.read_csv('/Users/xiaoa1/VSCode/ds/_notebooks/data/train_40k.csv')

columns = ['Text', 'Cat1']

df = shuffle(df_original[columns])

df.Cat1.value_counts()

p = re.compile(r'[^\w\s]+')

df['Text'] = [p.sub('', x) for x in df['Text'].tolist()]

df.apply(lambda x: x.astype(str).str.lower())

Unnamed: 0,Text,Cat1
14813,although this item is not cheap it has worked ...,health personal care
31868,this was my all time favorite for a number of ...,beauty
25302,this product came in a timely fashion is a gre...,pet supplies
31767,these taste graet for a health bar of course y...,health personal care
25006,my doctor recommened 2 brands for my osteoarth...,health personal care
...,...,...
14397,i love this white noise machine i was used to ...,health personal care
24667,my vet told me about this as a nonantibioticno...,beauty
20984,i am giving this as a gift and when it arrived...,toys games
301,i think this is one of the nicest in the city ...,toys games


### 4. Split the Dataset
scikit-learn has a feature to split data for training and testing. The test size and random state can be adjusted to increase accuracy.

In [9]:
x,y = df.Text, df.Cat1
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.2, random_state=4000)

### 5. Building a Pipeline
Pipelines are an iterative way to build a model to accomodate variations in data and improve accuracy. Because datasets can be so large now, its beneficial for data to be sent through a pipeline so that it can be manipulated

In [10]:
pipeline = Pipeline([('vect', TfidfVectorizer(ngram_range=(1,2), stop_words='english', sublinear_tf=True)),
                     ('chi', SelectKBest(chi2, k=10000)),
                     ('clf', LinearSVC(C=1.0, penalty='l1',max_iter=3000, dual=False))
                    ])

### 6. Model Training

This fits the pipeline to our data

In [11]:
model = pipeline.fit(train_x, train_y)

### 7. Model Evaluation
Using the score function to rate the accuracy of the model

In [12]:
print('accuracy score: '+ str(model.score(test_x, test_y)))

accuracy score: 0.843


### 8. Prediction
Now we can use the predict function to input our own written Amazon review and see what the predicted score is!

In [13]:
print(model.predict(['price is good but tasted a bit stale'])) # type your review here to see the predicted rating! exclude punctuation

['grocery gourmet food']
