# Applied Machine Learning Homework 4
Due 12/15/21 11:59PM EST

### Q1: Natural Language Processing

We will train a supervised training model to predict if a tweet has a positive or negative sentiment.

#### Dataset loading & dev/test splits

1.1) Load the twitter dataset from NLTK library

In [1]:
import nltk
nltk.download('twitter_samples')
from nltk.corpus import twitter_samples 

[nltk_data] Downloading package twitter_samples to
[nltk_data]     /Users/shuumatahou/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


1.2) Load the positive & negative tweets

In [2]:
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

1.3) Create a development & test split (80/20 ratio):

In [3]:
#code here

from sklearn.model_selection import train_test_split

all_tweets = all_positive_tweets + all_negative_tweets

labels = [1] * len(all_positive_tweets) + [0] * len(all_negative_tweets)

dev_tweets, test_tweets, dev_labels, test_labels = train_test_split(
    all_tweets, labels, test_size=0.2, random_state=42
)

#### Data preprocessing

We will do some data preprocessing before we tokenize the data. We will remove `#` symbol, hyperlinks, stop words & punctuations from the data. You can use the `re` package in python to find and replace these strings. 

1.4) Replace the `#` symbol with '' in every tweet

In [4]:
#code here
import re

pattern = re.compile(r'#')

clean_tweets = [re.sub(pattern, '', tweet) for tweet in all_tweets]


1.5) Replace hyperlinks with '' in every tweet

In [5]:
#code here
pattern = re.compile(r'http\S+|www\S+')

clean_tweets = [re.sub(pattern, '', tweet) for tweet in clean_tweets]


1.6) Remove all stop words

In [6]:
#code here
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

clean_tweets = [[word for word in tweet.split() if word.lower() not in stop_words] for tweet in clean_tweets]


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/shuumatahou/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


1.7) Remove all punctuations

In [7]:
#code here
import string

pattern = re.compile(r'[{}]'.format(re.escape(string.punctuation)))

clean_tweets = [[re.sub(pattern, '', word) for word in tweet] for tweet in clean_tweets]


1.8) Apply stemming on the development & test datasets using Porter algorithm

In [8]:
#code here
from nltk.stem import PorterStemmer

porter = PorterStemmer()

stemmed_dev_tweets = [[porter.stem(word) for word in tweet] for tweet in dev_tweets]
stemmed_test_tweets = [[porter.stem(word) for word in tweet] for tweet in test_tweets]


#### Model training

1.9) Create bag of words features for each tweet in the development dataset

In [9]:
#code here
from sklearn.feature_extraction.text import CountVectorizer

# Create a CountVectorizer object with the token_pattern parameter
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')

# Fit the vectorizer on the stemmed development tweets
vectorizer.fit_transform([' '.join(tweet) for tweet in stemmed_dev_tweets])

# Create bag of words features for the stemmed development tweets
bow_dev_features = vectorizer.transform([' '.join(tweet) for tweet in stemmed_dev_tweets])


1.10) Train a supervised learning model of choice on the development dataset

In [10]:
#code here
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()

clf.fit(bow_dev_features, dev_labels)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


1.11) Create TF-IDF features for each tweet in the development dataset

In [11]:
#code here
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(token_pattern=r'\b\w+\b')

vectorizer.fit_transform([' '.join(tweet) for tweet in stemmed_dev_tweets])

tfidf_dev_features = vectorizer.transform([' '.join(tweet) for tweet in stemmed_dev_tweets])


1.12) Train the same supervised learning algorithm on the development dataset with TF-IDF features

In [12]:
#code here

clf_1 = LogisticRegression()

clf_1.fit(tfidf_dev_features, dev_labels)


1.13) Compare the performance of the two models on the test dataset

In [13]:
#code here
bow_test_features = vectorizer.transform([' '.join(tweet) for tweet in stemmed_test_tweets])

tfidf_test_features = vectorizer.transform([' '.join(tweet) for tweet in stemmed_test_tweets])

bow_test_score = clf.score(bow_test_features, test_labels)
print("Bag of words model accuracy:", bow_test_score)

tfidf_test_score = clf_1.score(tfidf_test_features, test_labels)
print("TF-IDF model accuracy:", tfidf_test_score)



Bag of words model accuracy: 0.494
TF-IDF model accuracy: 0.6285
