# Introduction to Artificial Intelligence Final Project
## By Noah Segal-Gould and Tanner Cohan

### To implement:
[K-Means clustering, hierarchical document clustering, and topic modeling](http://brandonrose.org/clustering)

[K-Means clustering](http://scikit-learn.org/dev/auto_examples/text/plot_document_clustering.html)

#### Import libraries

In [1]:
import pandas as pd
from glob import glob
from os.path import basename, splitext

#### Create lists of file names for all Twitter account CSVs

In [2]:
house_accounts_filenames = glob("house/*.csv")

In [3]:
senate_accounts_filenames = glob("senate/*.csv")

#### Create lists of all dataframes for all CSVs

In [4]:
house_accounts_dataframes = [pd.read_csv(filename).assign(account="@" + splitext(basename(filename))[0]) 
                             for filename in house_accounts_filenames]

In [5]:
senate_accounts_dataframes = [pd.read_csv(filename).assign(account="@" + splitext(basename(filename))[0])
                              for filename in senate_accounts_filenames]

#### Find which Tweets were most Retweeted and Favorited in each list of dataframes

In [6]:
most_retweets_house_accounts_dataframes = [df.iloc[[df['Retweets'].idxmax()]] for df in house_accounts_dataframes]

In [7]:
most_favorites_house_accounts_dataframes = [df.iloc[[df['Favorites'].idxmax()]] for df in house_accounts_dataframes]

In [8]:
most_retweets_senate_accounts_dataframes = [df.iloc[[df['Retweets'].idxmax()]] for df in senate_accounts_dataframes]

In [9]:
most_favorites_senate_accounts_dataframes = [df.iloc[[df['Favorites'].idxmax()]] for df in senate_accounts_dataframes]

#### Create dataframes of the most Retweeted and Favorited Tweets for each account

In [10]:
most_retweets_congress_dataframe = pd.concat(most_retweets_house_accounts_dataframes + most_retweets_senate_accounts_dataframes)

In [11]:
most_favorites_congress_dataframe = pd.concat(most_favorites_house_accounts_dataframes + most_favorites_senate_accounts_dataframes)

#### Show the Retweets dataframe

In [12]:
most_retweets_congress_dataframe.sort_values('Retweets').tail()

Unnamed: 0,Text,Date,Favorites,Retweets,Tweet ID,account
38,56 years ago today I was released from Parchma...,2017-07-07 13:17:53,259935,114910,883314124863700995,@repjohnlewis
1535,Retweet if you care about @realDonaldTrump's t...,2017-01-11 16:40:05,45713,123593,819222357214658561,@RonWyden
1220,"Hey Republicans, don't worry, that burn is cov...",2017-03-24 19:53:43,310324,143726,845363015222542336,@SenatorMenendez
31,It's a shame the White House has become an adu...,2017-10-08 15:13:43,419380,148639,917045348820049920,@SenBobCorker
1421,"President Trump, you made a big mistake. By tr...",2017-01-21 22:15:24,972101,452896,822930622926745602,@SenSanders


#### Show the Favorites dataframe

In [13]:
most_favorites_congress_dataframe.sort_values('Favorites').tail()

Unnamed: 0,Text,Date,Favorites,Retweets,Tweet ID,account
333,The President of the United States just defend...,2017-08-15 21:03:44,221573,82791,897564488475586560,@SenWarren
38,56 years ago today I was released from Parchma...,2017-07-07 13:17:53,259935,114910,883314124863700995,@repjohnlewis
1220,"Hey Republicans, don't worry, that burn is cov...",2017-03-24 19:53:43,310324,143726,845363015222542336,@SenatorMenendez
31,It's a shame the White House has become an adu...,2017-10-08 15:13:43,419380,148639,917045348820049920,@SenBobCorker
1421,"President Trump, you made a big mistake. By tr...",2017-01-21 22:15:24,972101,452896,822930622926745602,@SenSanders


#### Combine all House of Representatives' accounts, all Senators' accounts, and then combine them together into all Congress accounts

In [14]:
house_dataframe = pd.concat(house_accounts_dataframes)

In [15]:
senate_dataframe = pd.concat(senate_accounts_dataframes)

In [16]:
congress_dataframe = pd.concat([house_dataframe, senate_dataframe]).reset_index(drop=True)

#### Print some statistics

In [17]:
print("Total number of Tweets for all accounts: " + str(len(congress_dataframe)))
print("Total number of accounts: " + str(len(set(congress_dataframe["account"]))))
print("Total number of house members: " + str(len(set(house_dataframe["account"]))))
print("Total number of senators: " + str(len(set(senate_dataframe["account"]))))

Total number of Tweets for all accounts: 1614749
Total number of accounts: 524
Total number of house members: 424
Total number of senators: 100
