# Social Media Mining: Putting it all Together
### Vincent Malic - Spring 2018

## Module 6.3. Putting It all Together
* Demonstration of the KDD process from start to finish - but in miniature. 
* Use this notebook to familiarize yourself with how all the steps fit together, 
* Note: you will do more in-depth research for your final project**.

### Basic idea: Research Question 
* Can we train a model to distinguish Tweets from Android and iPhone on Twitter account?
* Hypothesis assumes that language used by Android and iPhone users is different enough (statistically) that we can distinguish them. 
* Following Robinson's analysis: To detect two distinct writers on the Twitter account separating the signal from noise

### Use Tweepy to get 1000 most recent Tweets from realDT's Twitter account. 
* As I iterate, I am collecting only the Tweet text and Tweet source. 

In [1]:
API_KEY = ""
API_SECRET = ""
import tweepy
auth = tweepy.AppAuthHandler(API_KEY, API_SECRET)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

c = tweepy.Cursor(api.user_timeline, id="realDonaldTrump")

tweet_data = []

for tweet in c.items(1000):
    tweet_data.append([tweet.text, tweet.source])

In [2]:
tweet_data[:2]

[['RT @WhiteHouse: Merit-based immigration reform will benefit American workers and relieve the strain imposed by our current system on Federa…',
  'Twitter for iPhone'],
 ['RT @VollrathTammie: @realDonaldTrump @FoxNews https://t.co/sERi7Vyh5I',
  'Twitter for iPhone']]

I'm going to convert the data into a Pandas data frame for ease of data manipulation. 

In [3]:
import pandas as pd

df = pd.DataFrame(tweet_data, columns=["text", "source"])
df.head()

Unnamed: 0,text,source
0,RT @WhiteHouse: Merit-based immigration reform...,Twitter for iPhone
1,RT @VollrathTammie: @realDonaldTrump @FoxNews ...,Twitter for iPhone
2,RT @FoxNews: President @realDonaldTrump on Dem...,Twitter for iPhone
3,RT @FoxNews: President @realDonaldTrump on DAC...,Twitter for iPhone
4,I will be interviewed by @JudgeJeanine on @Fox...,Twitter for iPhone


## Use ``groupby`` feature to get aggregate statistics about Tweet source.

In [4]:
source_group = df.groupby("source")

In [5]:
source_group["source"].agg("count")

source
Media Studio           44
Twitter Web Client     33
Twitter for iPad       29
Twitter for iPhone    894
Name: source, dtype: int64

### Only want  Tweets from Media Studio or iPhone 
* Filter out everything not from either of these sources 

In [6]:
df = df[df["source"] != "Media Studio"]
df = df[df["source"] != "Twitter for IPad"]

In [7]:
df = df[df['source'].isin(["Twitter Web Client", "Twitter for iPhone"])]

In [8]:
df.shape

(927, 2)

In [9]:
df.head()

Unnamed: 0,text,source
0,RT @WhiteHouse: Merit-based immigration reform...,Twitter for iPhone
1,RT @VollrathTammie: @realDonaldTrump @FoxNews ...,Twitter for iPhone
2,RT @FoxNews: President @realDonaldTrump on Dem...,Twitter for iPhone
3,RT @FoxNews: President @realDonaldTrump on DAC...,Twitter for iPhone
4,I will be interviewed by @JudgeJeanine on @Fox...,Twitter for iPhone


### Of the original 1000 datapoints, 925 remain.

## Use an SVM
* Represent each Tweet using the vector space model, weighted by TF-IDF.
* Sklearn does the heavy lifting for us.

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [11]:
tv = TfidfVectorizer()

In [12]:
X = tv.fit_transform(df['text'])

In [13]:
X.shape

(927, 3897)

### There are 3897 unique words in this dataset. 
* Each Tweet is now represented by a vector with 3439 elements in it. 
* Each element is a word, represented by its TF-IDF weight.

## Create Test-Train split 
* X is a sparse vector, which is valid input for any sklearn function. 
* Create a train-test split.

In [14]:
y = df['source']
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=3057)

In [15]:
X_train.shape

(741, 3897)

In [16]:
X_test.shape

(186, 3897)

## Fit model to Training set and evaluate model performance
* Use sklearn fit() and accuracy_score() methods

In [17]:
from sklearn.svm import SVC
clf = SVC(kernel="linear")
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

from sklearn.metrics import accuracy_score, confusion_matrix

print(accuracy_score(y_test, y_pred))

0.94623655914


### Model accuracy is 94.62%
* A random baseline would attain an accuracy of 50%. 

In [18]:
from collections import Counter
print(Counter(y_test))

Counter({'Twitter for iPhone': 176, 'Twitter Web Client': 10})


In [19]:
97/(97+87)

0.5271739130434783

### Dummy classifier that guesses majority label (iPhone) would get accuracy of 52%. 
* We can be quite confident that the Android and iPhone Tweets are written by different people, 
* as the nature of the Tweets are different enough such that an algorithm can divide them with 82% accuracy.

In [20]:
confusion_matrix(y_test, y_pred)

array([[  0,  10],
       [  0, 176]])

Of the Android tweets, 84 were classified correctly and 3 incorrectly.

Of the iPhone tweets, 85 were classified correctly and 12 incorrectly.

# Important Note
* This shows you the KDD process from start to finish: 
* idea (research question), data collection, data transformation, fitting, evaluating, and interpreting the model. 
* Provides a blueprint for how to move forward with your final project.

## Final project will be much more extensive: 
1. Your final project should be **grounded by relevant background research**. Provide context to what you are trying to do and position your research with respect to other studies. 
2. **Include more features for powerful model**.
3. **Try more than one model** as appropriate for your data; include other models (e.g., Naive Bayes and Logistic Regression) to attain better accuracies. 
4. **Work on optimizing hyperparameters** (e.g., hyperparameter C). 
5. **Interpret and explain what your results mean**. If you get a model that works well, explain why you believe this is the case and what it means for your domain of interest. If you are unable to get a model that works well, you should try to find out why things didn't work out.