# Analyzing the Huskers vs. Buckeyes Game via Tweets

Now that I have a nice, cleaned data set of Tweets from the Nebraska vs Ohio State game on September 28th, let's do some basic data analysis and separate the Tweets into state-based data sets!

*note*: this is a work in progress and may be updated later, with projects like making n-grams out of common words and using tf-idf in order to determine which words are most correlated with Nebraska vs. Ohio state of origin.

In [25]:
# import libraries

import pandas as pd
import re
import collections
import itertools
from nltk.corpus import stopwords

#load the data set
tweets = pd.read_csv("Documents/Husker Project/cleanedGameTweets.csv")

#and get some basic info about it
print(tweets.columns)
print(tweets.shape)


Index(['Unnamed: 0', 'Username', 'Tweet ID', 'Time', 'User Location', 'Text',
       'Tweet Geo Coordinates', 'Is Retweet', 'Is Quote Tweet', 'Sentiment'],
      dtype='object')
(13190, 10)


One small thing: I neglected to get rid of "stop words" (common words that don't add much to the text analysis, like "an," "the," "that," etc). Using the nltk list of common English stop words, I'm going to remove them from my Tweets.

In [26]:
# what the tweets look like now
type(tweets['Text'])
print(tweets['Text'])

0               ['ohio', 'state', 'over', 'team', 'total']
1                 ['go', 'big', 'red', '#huskers', '#gbr']
2        ['brock', 'osweiler', 'at', 'today', 's', '#hu...
3        ['shewrap', 'ohio', 'state', 'when', 'they', '...
4                           ['let', 's', 'go', 'buckeyes']
5        ['i', 'think', 'tonight', 'is', 'the', 'night'...
6        ['hours', 'later', 'off', 'hours', 'of', 'slee...
7                                        ['old', 'friend']
8        ['clemson', 'survives', 'unc', 'but', 'what', ...
9        ['no', 'ohio', 'state', 'marching', 'to', 'vic...
10                              ['love', 'my', 'buckeyes']
11       ['lets', 'go', 'bucks', '#gobucks', '#buckeyen...
12       ['do', 'you', 'really', 'think', 'that', 'as',...
13       ['go', 'jeremy', 'ruckert', 'and', 'ohio', 'st...
14       ['it', 'was', 'tough', 'especially', 'with', '...
15       ['i', 'hope', 'nebraska', 'wins', 'in', 'my', ...
16       ['old', 'dominion', 'played', 'horrible', 'st'.

In [27]:
# seeing what the pre-loaded stopwords look like
stopWords = stopwords.words('english')
print(stopWords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [28]:
# trying to do a function because just [word for word...] isn't working
import ast

def removeStopWords(list1):
    """converts string to an actual list, then removes stop words"""
    actualList = ast.literal_eval(list1)
    cleanList = [word for word in actualList if word not in stopWords]
    return cleanList


tweets['Text'] = tweets['Text'].apply(removeStopWords)

# checking that it worked
print(tweets['Text'])

0                               [ohio, state, team, total]
1                           [go, big, red, #huskers, #gbr]
2                 [brock, osweiler, today, #huskers, game]
3        [shewrap, ohio, state, win, aldi, kroger, bowl...
4                                      [let, go, buckeyes]
5        [think, tonight, night, adrian, martinez, thro...
6        [hours, later, hours, sleep, stop, us, cheerin...
7                                            [old, friend]
8        [clemson, survives, unc, game, notre, dame, dr...
9        [ohio, state, marching, victory, go, buckeyes,...
10                                        [love, buckeyes]
11       [lets, go, bucks, #gobucks, #buckeyenation, #t...
12       [really, think, husker, fan, unaware, game, go...
13       [go, jeremy, ruckert, ohio, state, beat, nebra...
14       [tough, especially, ending, clemson, game, sav...
15       [hope, nebraska, wins, head, delusional, enoug...
16       [old, dominion, played, horrible, st, half, oh.

And removing the words that I used to collect the tweets in the first place (note: I made the string "ohio state" that I had originially searched for into two words and removed both): 

In [29]:
collectionWords = ['husker','huskers','cornhuskers','gbr','ohio', 'state','buckeyes',
                    'gobucks','osuvsneb', '#husker','#huskers','#cornhuskers','#gbr',
                    '#ohio,''#state','#buckeyes','#gobucks','#osuvsneb']

#defining a function to remove collection words
def removeCollectionWords(list1):
    """converts string to an actual list, then removes stop words"""
    cleanList = [word for word in list1 if word not in collectionWords]
    return cleanList

#applying it
tweets['Text'] = tweets['Text'].apply(removeCollectionWords)

#checking that it worked
print(tweets['Text'])

0                                            [team, total]
1                                           [go, big, red]
2                           [brock, osweiler, today, game]
3                 [shewrap, win, aldi, kroger, bowl, year]
4                                                [let, go]
5        [think, tonight, night, adrian, martinez, thro...
6        [hours, later, hours, sleep, stop, us, cheerin...
7                                            [old, friend]
8        [clemson, survives, unc, game, notre, dame, dr...
9           [marching, victory, go, link, buy, #toughlove]
10                                                  [love]
11                [lets, go, bucks, #buckeyenation, #tosu]
12       [really, think, fan, unaware, game, go, stomac...
13                   [go, jeremy, ruckert, beat, nebraska]
14       [tough, especially, ending, clemson, game, sav...
15       [hope, nebraska, wins, head, delusional, enoug...
16       [old, dominion, played, horrible, st, half, ne.

In [30]:
# making the tweets into one big list of words
allWords = list(itertools.chain(*tweets['Text']))

# Create counter
countsWords = collections.Counter(allWords)

# saving the top 25 words into a dataframe
topWords = pd.DataFrame(countsWords.most_common(25),
                             columns=['words', 'count'])
print(topWords)

       words  count
0   nebraska   1979
1       game   1392
2       team   1190
3       good    932
4       like    734
5         go    609
6        get    584
7   football    571
8       year    566
9    tonight    478
10      best    472
11      play    463
12    fields    455
13      time    450
14     first    444
15      fans    443
16      half    435
17     right    427
18   clemson    421
19     going    407
20       one    403
21       big    376
22       let    376
23     still    375
24    really    371


Saving the top 50 words and exporting them so I can graph in Tableau:

In [31]:
topWords.to_csv('Documents/Husker Project/mostCommonWords.csv')

Let's look at the most common Nebraskan (that is, from a user-reported location of Nebraska) words!

In [32]:
nebTweets = tweets[tweets['User Location'] == 'nebraska']
nebTweets.head()
nebTweets.shape

(2245, 10)

In [33]:
#repeating the same process to get the most common NE words
nebWords = list(itertools.chain(*nebTweets['Text']))
nebCountsWords = collections.Counter(nebWords)

# and saving to a dataframe
nebTopWords = pd.DataFrame(nebCountsWords.most_common(25),
                             columns=['words', 'count'])

Saving the top 50 words and exporting them so I can graph in Tableau:

In [34]:
nebTopWords.to_csv('Documents/Husker Project/nebraskaWords.csv')

and the most common Ohio-an words!

In [35]:
# same process--getting just those tweets from users in Ohio, then putting all of the words together and aggregating word counts!
ohioTweets = tweets[tweets['User Location'] == 'ohio']
ohioTweets.head()
ohioTweets.shape

ohioWords = list(itertools.chain(*ohioTweets['Text']))
ohioCountsWords = collections.Counter(ohioWords)
ohioCountsWords.most_common(150)

#and to a dataframe...

ohioTopWords = pd.DataFrame(ohioCountsWords.most_common(25),
                             columns=['words', 'count'])

In [36]:
ohioTopWords.to_csv('Documents/Husker Project/ohioWords.csv')