# PSY 341K Text Analysis for Behavioral Data Science
##### Spring 2024; written by: Prof Desmond Ong (desmond.ong@utexas.edu)

## Assignment 2

In this assignment we'll be processing a dataset using an NLP pipeline to extract linguistic features (n-grams), and then using these features to predict an outcome of interest.

In Assignment 1, we walked you through each step of the 'research' process. In Assignment 2, we'll guide you through the high-level goals, but you'll have a bit more latitude to decide how to go about each step in the process. (You have all the "mechanics" in terms of the code required, from the Tutorials). This assignment will be more challenging because it is a bit more open-ended, and the idea is to gradually build you towards executing your research project, which is the other extreme where you decide everything (and it does get pretty overwhelming!).

In [2]:
import nltk
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     /Users/ruthcarter/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

In [24]:
import numpy as np
from nltk.corpus import twitter_samples
import pandas as pd

The data we are using is a small sample of tweets that is packaged into `nltk.corpus`, and they are labelled as either "positive" or "negative".


### Your goal in this Assignment is to investigate what linguistic features predict a tweet being "Positive" or "Negative". 

In other words, you will calculate some linguistic features of interest, e.g., unigrams, bigrams, if you like, trigrams. You may also calculate other features, e.g., word count.

You will then use those features in a logistic regression (or other classification technique of your choosing) to predict the label of the tweets.

### Reading in the data

The data consists of 5,000 positive and 5,000 negative tweets.

In [4]:
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

print(len(positive_tweets), len(negative_tweets))

all_tweets = positive_tweets + negative_tweets

5000 5000


In [5]:
# print out the first ten positive tweets, and the first ten negative tweets, to get a sense of the text.

# --- your code ---
print(positive_tweets[:10])

print(negative_tweets[:10])


['#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)', '@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!', '@DespiteOfficial we had a listen last night :) As You Bleed is an amazing track. When are you in Scotland?!', '@97sides CONGRATS :)', 'yeaaaah yippppy!!!  my accnt verified rqst has succeed got a blue tick mark on my fb profile :) in 15 days', '@BhaktisBanter @PallaviRuhail This one is irresistible :)\n#FlipkartFashionFriday http://t.co/EbZ0L2VENM', "We don't like to keep our lovely customers waiting for long! We hope you enjoy! Happy Friday! - LWWF :) https://t.co/smyYriipxI", '@Impatientraider On second thought, there’s just not enough time for a DD :) But new shorts entering system. Sheep must be buying.', 'Jgh , but we have to go to Bayan :D bye', 'As an act of mischievousness, am calling the ETL layer of our in-house warehousing app Katam

#### Comment on your observations! What do you notice? Are there things that you have to take note of?

#### Your Written Answer here
- The positive tweets use a lot of smiley faces and exclamation marks. They also seem to interact with other users more often. There are also abbreviations of words (like tmr for tomorrow or fb for facebook, or ETL for... something...) that we need to consider. Otherwise, we won't fully understand the meaning of the tweet. Also, since hashtags are usually one whole word, it will be difficult to analyze (like #FlipkartFashionFriday). Also, not many tweets in this dataset seem to use emojis, but it looks like it does occasionally happen. I think this means we should use the "from nltk.tokenize.casual import TweetTokenizer" instead of normally tokenizing the text. Additionally, since tweets are popular in many countries, we may need to account for different spellings of English words (like neighbour and neighbor). 

### Calculating Features

Let's process the tweets! In class, we've covered a number of different preprocessing steps to calculate linguistic features. The choice of which steps to use (or not use) really depends on the specific context.

- For example, we talked about why stop words are removed, but also why it may be interesting to keep stop words.
    - 
- As another example, we talked about identifing Named Entities. But what do you do with them? You could decide to keep them in as features if you have specific hypotheses (e.g., if you're studying some political text, it might be handy to keep in the names of certain politicians). Or you might decide that actually names are irrelevant to your research question and remove them.
    - might not be relevant, since we're just looking at whether tweets are positive or negative. there's not really any specific entities that would aid in whether tweets are pos or neg. 

The key is to really take some time to understand your data, and especially as it pertains to your hypotheses. As in Assignment 1, please `print()` and read some of the examples to get a sense for the language used. Please also `print()` out your variables as you are calculating them. Then you might notice additional issues that you may need to correct. 

- A simple one that we didn't cover in class (because it's quite straightforward) is lower-case normalization: that is, converting all the text to lowercase, say using `.lower()`. This is so `A strawberry` and `a strawberry` will become the same bigram. BUT lower-casing will also make `American` into `american`. (Also, lower-casing will throw off the POS-taggers/NER identifiers, which are case sensitive).


#### Note: Creating a feature array

Note that after you preprocess and calculate your unigrams/n-grams, you need to convert the features into a large word-count array, with a corresponding "vocabulary". 

For example, we can take all the unigrams and arrange them alphabetically:

- ["American", "and", ...]

and if the first tweet is "this American is a proud American" (so 2 `American`s and 0 `and`s), and the second tweet is "and I am happy" (0 `American` and 1 `and`), then we need to create a feature array that looks like:

- [[2, 0, ...]
- [0, 1, ...]
- ... ]

such that the rows give the features for each tweet, while the columns give the word-count of each n-gram in the vocabulary. This `num_tweet` by `num_feature` array can then be used as the independent variables ("X") in the regression. 


## Please process the text and calculate unigram and bigram features for each tweet.

Please add as many code boxes as you need, and document your steps (e.g., with markdown chunks or with in-line comments).


In [6]:
#### Your code below

In [7]:
#tokenize the tweets: 
from nltk.tokenize.casual import TweetTokenizer
ttokenizer = TweetTokenizer() # here we have to create a TweetTokenizer object

positive_tweets_list = [] 
negative_tweets_list = [] 


#tokenizing pos tweets
for i in positive_tweets: 
    j= i.lower()
    tokenized = ttokenizer.tokenize(j)
    positive_tweets_list.append(tokenized)
    
print(positive_tweets_list[0:5])



#tokenizing neg tweets
for i in negative_tweets: 
    j= i.lower()
    tokenized = ttokenizer.tokenize(j)
    negative_tweets_list.append(tokenized)
    
print(negative_tweets_list[0:5])

[['#followfriday', '@france_inte', '@pkuchly57', '@milipol_paris', 'for', 'being', 'top', 'engaged', 'members', 'in', 'my', 'community', 'this', 'week', ':)'], ['@lamb2ja', 'hey', 'james', '!', 'how', 'odd', ':/', 'please', 'call', 'our', 'contact', 'centre', 'on', '02392441234', 'and', 'we', 'will', 'be', 'able', 'to', 'assist', 'you', ':)', 'many', 'thanks', '!'], ['@despiteofficial', 'we', 'had', 'a', 'listen', 'last', 'night', ':)', 'as', 'you', 'bleed', 'is', 'an', 'amazing', 'track', '.', 'when', 'are', 'you', 'in', 'scotland', '?', '!'], ['@97sides', 'congrats', ':)'], ['yeaaaah', 'yippppy', '!', '!', '!', 'my', 'accnt', 'verified', 'rqst', 'has', 'succeed', 'got', 'a', 'blue', 'tick', 'mark', 'on', 'my', 'fb', 'profile', ':)', 'in', '15', 'days']]
[['hopeless', 'for', 'tmr', ':('], ['everything', 'in', 'the', 'kids', 'section', 'of', 'ikea', 'is', 'so', 'cute', '.', 'shame', "i'm", 'nearly', '19', 'in', '2', 'months', ':('], ['@hegelbon', 'that', 'heart', 'sliding', 'into', 'th

In [64]:
#calculating unigram features 


#unigram feat for pos tweets: 
unigram_pos_freq = []

for i in positive_tweets_list: 
    unigram_frequency_distribution = nltk.FreqDist(i)
    unigram_pos_freq.append(unigram_frequency_distribution)

unigram_pos = [tuple(entry.items()) for entry in unigram_pos_freq]

#unigram feat for neg tweets: 
unigram_neg_freq = []

for i in negative_tweets_list: 
    unigram_frequency_distribution = nltk.FreqDist(i)
    unigram_neg_freq.append(unigram_frequency_distribution)

unigram_neg = [tuple(entry.items()) for entry in unigram_neg_freq]


all_unigrams = unigram_pos + unigram_neg

all_tweets = positive_tweets_list + negative_tweets_list

In [28]:
len(all_unigrams)

10000

In [63]:
#calculating unigram feature array: 

flattened_list = [item for sublist in all_unigrams for item in sublist]

unigrams_df = pd.DataFrame(flattened_list)

unique_unigrams = pd.Series(unigrams_df[0].sort_values().unique())





0                       !
1                       "
2                       #
3                 ##bbmme
4        ##segalakatakata
               ...       
21732                   🚮
21733                   🚲
21734                   󾆖
21735                   󾌴
21736                   󾰀
Length: 21737, dtype: object

In [71]:
import re
from collections import Counter

In [78]:
word_counts = Counter()
unigram_list = []

for tweet_tokens in all_tweets:
    # Iterate over each word in the array of words to check
    for word in unique_unigrams:
        # Count the occurrences of the word in the tweet tokens and update the Counter
        word_counts[word] += tweet_tokens.count(word)


In [100]:
unigram_counts_per_tweet = []

# Iterate over each tweet
for tweet in all_tweets:
    # Initialize a Counter for the current tweet
    word_counts = Counter()
    
    # Iterate over each unique unigram
    for word in unique_unigrams:
        # Count the occurrences of the word in the tweet tokens and update the Counter
        word_counts[word] = tweet_tokens.count(word)
    
    # Append the Counter to the list
    unigram_counts_per_tweet.append(word_counts)
    
values_list = [count for counter in unigram_counts_per_tweet for count in counter.values()]


In [99]:
unigram_counts_per_tweet[0]

Counter({'!': 0,
         '"': 0,
         '#': 0,
         '##bbmme': 0,
         '##segalakatakata': 0,
         '#100reasonstovisitmombasa': 0,
         '#1948': 0,
         '#1tbps4': 0,
         '#2fm': 0,
         '#39': 0,
         '#4thstreetmusic': 0,
         '#50notifications': 0,
         '#5minute': 0,
         '#actuallythough': 0,
         '#addme': 0,
         '#addmeonbbm': 0,
         '#addmeonsnapchat': 0,
         '#admin_myung': 0,
         '#aflblueshawks': 0,
         '#agnezmo': 0,
         '#airfields': 0,
         '#airforce': 0,
         '#airport': 0,
         '#akshaymostlovedsuperstarever': 0,
         '#akua': 0,
         '#al_master_band': 0,
         '#alberta': 0,
         '#aldub': 0,
         '#alevel': 0,
         '#alienthought': 0,
         '#allgoodthingske': 0,
         '#aluminiumfree': 0,
         '#alwayskeepfighting': 0,
         '#am': 0,
         '#amateur': 0,
         '#amazon': 0,
         '#aminormalyet': 0,
         '#amnotness': 0,
 

In [86]:

uni_unique_counts = pd.Series(unigram_counts_per_tweet)

In [91]:
all_tweets = pd.Series(all_tweets)

In [92]:
#form the dataframe 
tweet_df = pd.concat([all_tweets, uni_unique_counts], axis=1)


In [96]:
tweet_df

Unnamed: 0,0,1
0,"[#followfriday, @france_inte, @pkuchly57, @mil...","{'!': 0, '""': 0, '#': 0, '##bbmme': 0, '##sega..."
1,"[@lamb2ja, hey, james, !, how, odd, :/, please...","{'!': 0, '""': 0, '#': 0, '##bbmme': 0, '##sega..."
2,"[@despiteofficial, we, had, a, listen, last, n...","{'!': 0, '""': 0, '#': 0, '##bbmme': 0, '##sega..."
3,"[@97sides, congrats, :)]","{'!': 0, '""': 0, '#': 0, '##bbmme': 0, '##sega..."
4,"[yeaaaah, yippppy, !, !, !, my, accnt, verifie...","{'!': 0, '""': 0, '#': 0, '##bbmme': 0, '##sega..."
...,...,...
9995,"[i, wanna, change, my, avi, but, usanele, :(]","{'!': 0, '""': 0, '#': 0, '##bbmme': 0, '##sega..."
9996,"[my, puppy, broke, her, foot, :(]","{'!': 0, '""': 0, '#': 0, '##bbmme': 0, '##sega..."
9997,"[where's, all, the, jaebum, baby, pictures, :(...","{'!': 0, '""': 0, '#': 0, '##bbmme': 0, '##sega..."
9998,"[but, but, mr, ahmad, maslan, cooks, too, :(, ...","{'!': 0, '""': 0, '#': 0, '##bbmme': 0, '##sega..."


In [81]:
word_counts

Counter({'!': 2675,
         '"': 479,
         '#': 22,
         '##bbmme': 2,
         '##segalakatakata': 1,
         '#100reasonstovisitmombasa': 1,
         '#1948': 1,
         '#1tbps4': 1,
         '#2fm': 1,
         '#39': 1,
         '#4thstreetmusic': 1,
         '#50notifications': 1,
         '#5minute': 1,
         '#actuallythough': 1,
         '#addme': 2,
         '#addmeonbbm': 1,
         '#addmeonsnapchat': 3,
         '#admin_myung': 1,
         '#aflblueshawks': 2,
         '#agnezmo': 1,
         '#airfields': 1,
         '#airforce': 1,
         '#airport': 1,
         '#akshaymostlovedsuperstarever': 1,
         '#akua': 2,
         '#al_master_band': 1,
         '#alberta': 1,
         '#aldub': 3,
         '#alevel': 1,
         '#alienthought': 1,
         '#allgoodthingske': 1,
         '#aluminiumfree': 1,
         '#alwayskeepfighting': 1,
         '#am': 2,
         '#amateur': 5,
         '#amazon': 4,
         '#aminormalyet': 1,
         '#amnotness'

In [58]:
unigrams_df

Unnamed: 0,0,1
0,#followfriday,1
1,@france_inte,1
2,@pkuchly57,1
3,@milipol_paris,1
4,for,1
...,...,...
121042,expecting,1
121043,misserable,1
121044,few,1
121045,weeks,1


In [25]:
#calculating bigram features: 

#for pos tweets: 
bigram_pos_freq = [] 

for entry in positive_tweets_list:
    bigram_pos_zip = []
    text_for_pos_bigrams = nltk.ngrams(entry, 2)
    bigram_pos_zip.append(text_for_pos_bigrams)
    for zip_text in bigram_pos_zip: 
        bigram_frequency_distribution = nltk.FreqDist(zip_text)
        bigram_pos_freq.append(bigram_frequency_distribution)
bigram_pos = [tuple(entry.items()) for entry in bigram_pos_freq]


#for neg tweets: 
bigram_neg_freq = [] 

for entry in negative_tweets_list:
    bigram_neg_zip = []
    text_for_neg_bigrams = nltk.ngrams(entry, 2)
    bigram_neg_zip.append(text_for_neg_bigrams)
    for zip_text in bigram_neg_zip: 
        bigram_frequency_distribution = nltk.FreqDist(zip_text)
        bigram_neg_freq.append(bigram_frequency_distribution)
bigram_neg = [tuple(entry.items()) for entry in bigram_neg_freq]






### Creating a Train/Test set Split

Now that we're done calculating features and are ready to move onto making predictions, let's split up the examples into a training set and a test set, in order to avoid overfitting.

Please split up the data with **80% in the training set** and the remaining **20% in the test set**. 

In [None]:
# you can do a simple version (e.g., put the first 4000 
# positive examples into a training set)
# or you can also choose a random split.

# be sure to create labels too!
# Let's use 1 = pos, 0 = neg

# --- your code ---











### Using the linguistic features to make predictions

Now that you have calculated the linguistic features for each tweet, and you have divided your data into a training set and a test set, let's take stock of the main variables you should have. (Note: *num_features* may differ depending on the choices you made to calculate your features.)

You should have:

- a 8000 x *num_features* array that contains the features for the 8000 tweets in the training set
- a 2000 x *num_features* array that contains the features for the 2000 tweets in the test set
- a 8000 x 1 array that contains the labels (pos/neg) for the 8000 tweets in the training set
- a 2000 x 1 array that contains the labels (pos/neg) for the 2000 tweets in the test set

- an array that contains information on what each of the features mean (i.e., for the unigram/bigram features, this refers to the "vocabulary". You'll need this to interpret the results.)


If you've made it thus far, great, you're almost there! The remaining bit of work is to (i) train a model on the training set, (ii) evaluate the classification accuracy on the test set, and (iii) evaluate the features that are predictive of the label (and discuss).

In [None]:
# (i) Train a model on the training set

# please set up and train a logistic regression model on the training set.
# if your number of features is much larger than the training set size, you may wish to consider using regularization


# NOTE if you are using statsmodel, this may take a while to train.
# sklearn does it much faster. 

# --- your code ---



In [None]:
# (ii) Evaluate the classification accuracy on the test set

# Using your model, make label predictions on the test set (by using the model on the test features).
# compare them against the actual test set labels.
# what is the classification accuracy of this model?

# --- your code ---


In [None]:
# (iii) Evaluate the features that are predictive of the label (and discuss).

# Take a look at the features that are most predictive of the label. 
# For example, which unigrams or bigrams were most predictive?
# do these make sense?

# --- your code, and written text response ---


