This is a demonstration of the KDD process from start to finish - but in miniature. Please use this notebook to familiarize yourself with how all the steps fit together, **but note that you will be expected to do more in-depth research for your final project**.

The basic idea here is to determine the extent to which Tweets from Android and Tweets from iPhone can be distinguished from each other. Our hypothesis, motivated by our first-week exploration of Robinson's analysis, is that there are two distinct writers on the Twitter account. If our hypothesis is true, we should be able to tell their writing apart.

I'm using Tweepy to get the 1000 most recent Tweets from Donald Trump's Twitter account. As I iterate, I am collecting only the Tweet text and Tweet source. 

In [1]:
API_KEY = ""
API_SECRET = ""
import tweepy
auth = tweepy.AppAuthHandler(API_KEY, API_SECRET)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

c = tweepy.Cursor(api.user_timeline, id="realDonaldTrump")

tweet_data = []

for tweet in c.items(1000):
    tweet_data.append([tweet.text, tweet.source])

In [2]:
tweet_data[:2]

[['Just leaving Florida. Big crowds of enthusiastic supporters lining the road that the FAKE NEWS media refuses to mention. Very dishonest!',
  'Twitter for Android'],
 ['Congratulations Stephen Miller- on representing me this morning on the various Sunday morning shows. Great job!',
  'Twitter for iPhone']]

I'm going to convert the data into a Pandas data frame for ease of data manipulation. 

In [3]:
import pandas as pd

df = pd.DataFrame(tweet_data, columns=["text", "source"])
df.head()

Unnamed: 0,text,source
0,Just leaving Florida. Big crowds of enthusiast...,Twitter for Android
1,Congratulations Stephen Miller- on representin...,Twitter for iPhone
2,I know Mark Cuban well. He backed me big-time ...,Twitter for Android
3,"After two days of very productive talks, Prime...",Twitter for Android
4,"While on FAKE NEWS @CNN, Bernie Sanders was cu...",Twitter for Android


I'll now use the ``groupby`` feature to get some aggregate statistics about the Tweet source.

In [4]:
source_group = df.groupby("source")

In [5]:
source_group["source"].agg("count")

source
Periscope                1
Twitter Ads              1
Twitter Web Client      75
Twitter for Android    437
Twitter for iPad         3
Twitter for iPhone     483
Name: source, dtype: int64

I only want to use Tweets from iPhone and Android so I need to filter them out.

In [6]:
df = df[df['source'].isin(["Twitter for Android", "Twitter for iPhone"])]

In [7]:
df.shape

(920, 2)

In [8]:
df.head()

Unnamed: 0,text,source
0,Just leaving Florida. Big crowds of enthusiast...,Twitter for Android
1,Congratulations Stephen Miller- on representin...,Twitter for iPhone
2,I know Mark Cuban well. He backed me big-time ...,Twitter for Android
3,"After two days of very productive talks, Prime...",Twitter for Android
4,"While on FAKE NEWS @CNN, Bernie Sanders was cu...",Twitter for Android


Everything that's not iPhone or Android has been removed. Of the original 1000 datapoints, 925 remain.

I would like to use an SVM, and represent each Tweet using the vector space model, weighted by TF-IDF. As discussed in the previous lecture, sklearn does the heavy lifting for us.

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [10]:
tv = TfidfVectorizer()

In [11]:
X = tv.fit_transform(df['text'])

In [12]:
X.shape

(920, 3439)

There are 3439 unique words in this dataset. Each Tweet is now represented by a vector with 3439 elements in it. Each element represents the TF-IDF weight of its corresponding word.

X is a sparse vector, but is a valid input for any sklearn function. Let's make a train-test split.

In [20]:
y = df['source']
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=3057)

In [21]:
X_train.shape

(736, 3439)

In [22]:
X_test.shape

(184, 3439)

We've conducted our split. There are 736 training examples and 184 test examples.

Let's train and evaluate our model. We've done this before many times, so the procedure should be quite familiar at this point. 

In [16]:
from sklearn.svm import SVC
clf = SVC(kernel="linear")
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

from sklearn.metrics import accuracy_score, confusion_matrix

print(accuracy_score(y_test, y_pred))

0.91847826087


91% is quite good. A random baseline would attain an accuracy of 50%. 

In [17]:
from collections import Counter
print(Counter(y_test))

Counter({'Twitter for iPhone': 97, 'Twitter for Android': 87})


In [18]:
97/(97+87)

0.5297297297297298

A dummy classifier that guess the majority label (iPhone) would get an accuracy of 52%. We can be quite confident that the Android and iPhone Tweets are written by different people, as the nature of the Tweets are different enough such that an algorithm can divide them with 82% accuracy.

In [19]:
confusion_matrix(y_test, y_pred)

array([[84,  3],
       [12, 85]])

Of the Android tweets, 84 were classified correctly and 3 incorrectly.

Of the iPhone tweets, 85 were classified correctly and 12 incorrectly.

# Important Note

This shows you the KDD process from start to finish: from idea, to data collection, to data transformation, to training a model, to evaluating and interpreting the model. Hopefully, this will give you a good blueprint for how to move forward with your final project.

However, your final project **must be more extensive** than this example here. If you submitted a project based on doing only what I've done above, I will have to deduct points. 

1. Your final project should be more grounded by relevant background research. There should be a context to what you're attempting to do and you should position your research with respect to other research. 
2. In this scenario, I only took Tweet text into account. You should endeavour, to the extent possible, to include more features for a more powerful model.
3. You should probably try more than one model. I've used only a linear SVM here. If your data is appropriate, you should also try other models to see if you can attain better accuracies. I could have also used Naive Bayes and Logistic Regression in this scenario. 
4. You should work on optimizing hyperparameters. Here, for example, there is a hyperparameter C that I didn't touch. 
5. You should try to interpret and explain what your results mean. If you get a model that works well, explain why you believe this is the case and what it means for your domain of interest. If you are unable to get a model that works well, you should try to find out why things didn't work out.