# Natural Language Processing

Project outline courtesy of ChatGPT, code and revisions by xiaoa0

### 1. Install Required Libraries
Ensure that you have the necessary libraries installed. You'll need scikit-learn, numpy, and pandas, which can be installed via pip:

In [37]:
pip install scikit-learn numpy pandas

Note: you may need to restart the kernel to use updated packages.


### 2. Import Libraries
Import the required libraries in your Python script or Jupyter Notebook:

In [38]:
import numpy as np
import pandas as pd
import re
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.svm import LinearSVC

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn import svm
from sklearn.metrics import accuracy_score

### 3. Load and Preprocess Data
Load your labeled dataset into a pandas DataFrame. Preprocess the text data by cleaning and preparing it for analysis, including steps like removing punctuation, stopwords, and performing tokenization. Ensure that your DataFrame has two columns: "text" (containing the text data) and "label" (containing the corresponding labels).

In [39]:
df_original = pd.read_csv('/Users/xiaoa1/VSCode/ds/_notebooks/data/train_40k.csv')

df_val = pd.read_csv('/Users/xiaoa1/VSCode/ds/_notebooks/data/val_10k.csv')

columns = ['Text', 'Cat1', 'Cat2']

df = shuffle(df_original[columns])

df.Cat1.value_counts()
df.Cat2.value_counts()

p = re.compile(r'[^\w\s]+')

df['Text'] = [p.sub('', x) for x in df['Text'].tolist()]

df.apply(lambda x: x.astype(str).str.lower())

Unnamed: 0,Text,Cat1,Cat2
33490,i dont know why you would buy anything else un...,pet supplies,dogs
34087,after trying out about 12 different ones i cam...,beauty,fragrance
571,i bought these for my 7 year old nephew for ch...,toys games,electronics for kids
20015,my sons often raid my cabinets for my sushi no...,grocery gourmet food,snack food
4134,not only is this product actually good for you...,beauty,skin care
...,...,...,...
263,this puzzle is well worth buying as it is quit...,toys games,puzzles
11302,this is my first epilator so i dont have much ...,health personal care,personal care
21190,very cute a little smaller than i expected but...,toys games,baby toddler toys
20772,works well with or without an ice pack there w...,baby products,feeding


### 4. Split the Dataset
Split your dataset into training and testing sets using train_test_split from scikit-learn. This ensures that you have data for training the model as well as evaluating its performance:

In [40]:
x,y,z = df.Text, df.Cat1, df.Cat2
train_x, test_x, train_y, test_y, train_z, test_z = train_test_split(x, y, z, test_size=0.2, random_state=42)

### 5. Feature Extraction
Convert the text data into numerical feature vectors that can be used by machine learning algorithms. Use scikit-learn's CountVectorizer to convert text into a matrix of token counts:

In [41]:
pipeline = Pipeline([('vect', TfidfVectorizer(ngram_range=(1,2), stop_words='english', sublinear_tf=True)),
                     ('chi', SelectKBest(chi2, k=10000)),
                     ('clf', LinearSVC(C=1.0, penalty='l1',max_iter=3000, dual=False))
                    ])

### 6. Model Training
Choose a classification algorithm, such as Support Vector Machines (SVM), and train the model using the training data:

In [42]:
model = pipeline.fit(train_x, train_y, train_z)

TypeError: Pipeline.fit() takes from 2 to 3 positional arguments but 4 were given

### 7. Model Evaluation
Use the trained model to predict the labels for the test data, and evaluate its performance using accuracy or other suitable metrics:

In [None]:
print('accuracy score: '+ str(model.score(test_x, test_y, test_z)))

accuracy score: 0.843375


### 8. Prediction
Once your model is trained and evaluated, you can use it to classify new, unseen text data by transforming the text into feature vectors using the CountVectorizer and then applying the trained model's predict method.

In [None]:
print(model.predict(['bright color and good texture']))

['beauty']
