<a href="https://colab.research.google.com/github/wrobbins0409/cse30124-project/blob/main/introToAI_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Intro to AI Project - Classical Sentiment Analysis vs Machine learning ###



In [1]:
# install necessary packages
!pip install transformers
!pip install datasets



In [None]:
# Imports

import torch
from datasets import load_dataset
from torch.utils.data import Dataset, DataLoader
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, AdamW
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tokenizers import ByteLevelBPETokenizer
import pandas as pd
import re
from pathlib import Path

### Preprocessing and Loading Data ###

Here we will be loading and preparing both the training and testing data. We will start by loading the data into a pd dataframe and then converting all of the text data to type of string and then removing any special characters.

In [None]:
# Create function for getting rid of any special characters in the text
def preprocess_text(text):
    # Example text preprocessing: lowercase, remove special characters (except punctuation), etc.
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s\d.,;!?\'"]', '', text)  # Remove special characters, keep punctuation
    # Add more preprocessing steps as needed

    return text

# Load and preprocess the training dataset
train_url = 'https://raw.githubusercontent.com/wrobbins0409/cse30124-project/main/data/sentiment_data/train.csv'
train_df = pd.read_csv(train_url)
le = LabelEncoder()
train_df['label'] = le.fit_transform(train_df['sentiment'])

# apply preprocessing to text
train_df['text'] = train_df['text'].astype(str)
train_df['text'] = train_df['text'].apply(preprocess_text)

# get rid of selected_text because we will not be using it for training
train_df = train_df.drop('selected_text', axis = 1)

# Load and preprocess the testing dataset
test_url = 'https://raw.githubusercontent.com/wrobbins0409/cse30124-project/main/data/sentiment_data/test.csv'
test_df = pd.read_csv(test_url)
le = LabelEncoder()
test_df['label'] = le.fit_transform(test_df['sentiment'])

# apply preprocessing once again
test_df['text'] = test_df['text'].astype(str)
test_df['text'] = test_df['text'].apply(preprocess_text)

display(train_df)
display(test_df)

Unnamed: 0,textID,text,sentiment,label
0,cb774db0d1,"id have responded, if i were going",neutral,1
1,549e992a42,sooo sad i will miss you here in san diego!!!,negative,0
2,088c60f138,my boss is bullying me...,negative,0
3,9642c003ef,what interview! leave me alone,negative,0
4,358bd9e861,"sons of , why couldnt they put them on the re...",negative,0
...,...,...,...,...
27476,4eac33d1c0,wish we could come see u on denver husband l...,negative,0
27477,4f4c4fc327,ive wondered about rake to. the client has m...,negative,0
27478,f67aae2310,yay good for both of you. enjoy the break yo...,positive,2
27479,ed167662a5,but it was worth it .,positive,2


Unnamed: 0,textID,text,sentiment,label
0,f87dea47db,last session of the day httptwitpic.com67ezh,neutral,1
1,96d74cb729,shanghai is also really exciting precisely s...,positive,2
2,eee518ae67,"recession hit veronique branquinho, she has to...",negative,0
3,01082688c6,happy bday!,positive,2
4,33987a8ee5,httptwitpic.com4w75p i like it!!,positive,2
...,...,...,...,...
3529,e5f0e6ef4b,"its at 3 am, im very tired but i cant sleep b...",negative,0
3530,416863ce47,all alone in this old house again. thanks for...,positive,2
3531,6332da480c,i know what you mean. my little dog is sinkin...,negative,0
3532,df1baec676,sutra what is your next youtube video gonna be...,positive,2


### Tokenizer ###

We will now need to prepare the tokenizer for our task at hand, which in this case is just a tokenizer tailored to the english language.

First we will have to load a large corpus of english text for the tokenizer to train on.

In [None]:
# download raw corpus data
!wget -c https://media.githubusercontent.com/media/wrobbins0409/cse30124-project/main/data/tokenizer_data/en_part_1.txt
!wget -c https://media.githubusercontent.com/media/wrobbins0409/cse30124-project/main/data/tokenizer_data/en_part_2.txt
!wget -c https://media.githubusercontent.com/media/wrobbins0409/cse30124-project/main/data/tokenizer_data/en_part_3.txt

paths = [str(x) for x in Path(".").glob("**/*.txt")]

# Initialize Byte Level Tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

# Make model directory
!mkdir wrobbins0409-english-sentiment

# save new model to directory
tokenizer.save_model(".", "wrobbins0409-english-sentiment")

--2023-11-15 07:08:48--  https://media.githubusercontent.com/media/wrobbins0409/cse30124-project/main/data/tokenizer_data/en_part_1.txt
Resolving media.githubusercontent.com (media.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to media.githubusercontent.com (media.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1000001091 (954M) [text/plain]
Saving to: ‘en_part_1.txt’


2023-11-15 07:09:07 (138 MB/s) - ‘en_part_1.txt’ saved [1000001091/1000001091]

--2023-11-15 07:09:08--  https://media.githubusercontent.com/media/wrobbins0409/cse30124-project/main/data/tokenizer_data/en_part_2.txt
Resolving media.githubusercontent.com (media.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to media.githubusercontent.com (media.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1000000661 (