## problem statement
Your task is to write a small Python or PySpark script that **predicts the total number of retweets a tweet**  will get using only the provided dataset. This task is designed to test your Python or PySpark ability, your knowledge of Data Science techniques and your ability to work effectively, efficiently and independently within a commercial setting. This task is not designed to test your hypertuning abilities or lateral thinking. Think of this as iteration one or a proof of concept. Be critical of the data it is raw straight from the Twitter API.

Deliverables:

One Python or PySpark script – a Jupyter notebook is preferable

One virtual environment requirements text file including an exhaustive list of packages and version numbers used in your solution (pip freeze > requirements.txt)

Nothing else

 Your solution should at a minimum do the following:

Load the data into memory

Prepare the data for modelling

Build a model on training data

Test the model on testing data

Provide some measure of performance

 Please include a list of references (commented in the Python or PySpark script) if you include code from external sources. We suggest spending no more than four hours on this task.

In [19]:
## loading data using pandas
import pandas as pd
import numpy as np
from pandas import ExcelFile
xls = ExcelFile('tweets.xlsx')
df = xls.parse(xls.sheet_names[0])
len(df)


42368

In [4]:
from dateutil import parser
def time_format(date_string):
    dt = parser.parse(date_string)
    return(dt.strftime("%Y-%m-%d %H:%M:%S"))

In [5]:
df.head()

Unnamed: 0,TweetPostedTime,TweetID,TweetBody,TweetRetweetFlag,TweetSource,TweetInReplyToStatusID,TweetInReplyToUserID,TweetInReplyToScreenName,TweetRetweetCount,TweetFavoritesCount,...,UserDescription,UserLink,UserExpandedLink,UserFollowersCount,UserFriendsCount,UserListedCount,UserSignupDate,UserTweetCount,MacroIterationNumber,tweet.place
0,Tue Dec 20 10:57:00 +0000 2016,811163485052817408,RT @BeachyMaldives: Local interaction is a gre...,True,"<a href=""http://twitter.com/download/iphone"" r...",,,,1,0,...,Pls donate 2 https://t.co/RvOUK9lAWI #YearEndG...,https://t.co/jghZVBsiQF,http://cjqenterprises.com,6334,6144,1917,Sun Jun 14 22:36:15 +0000 2015,33556,0,
1,Tue Dec 20 10:56:59 +0000 2016,811163483463122944,RT @TechTerraEd: Need #giftideas for your kid(...,True,"<a href=""http://twitter.com/download/iphone"" r...",,,,1,0,...,"Educator of students with special needs, Mothe...",,,154,371,180,Sat Jan 02 13:36:23 +0000 2010,3201,0,
2,Tue Dec 20 10:56:55 +0000 2016,811163466387988480,Seven Questions Before Choosing a Cruise Line ...,False,"<a href=""http://www.google.com/"" rel=""nofollow...",,,,0,0,...,Thrifty Mom Media social media consulting and ...,https://t.co/cEhGzaQJp6,http://www.thriftymommastips.com/,23433,24762,961,Tue May 26 21:26:09 +0000 2009,147958,0,
3,Tue Dec 20 10:56:55 +0000 2016,811163465125679104,"RT @CMGsportsclub: Yoga do Brasil, un havre de...",True,"<a href=""https://roundteam.co"" rel=""nofollow"">...",,,,1,0,...,"Adventure travel, yoga, paleo, Crossfit, runni...",https://t.co/3IHwXkgAkA,https://primalsanctuary.com,11136,10081,978,Sat Sep 12 20:29:18 +0000 2015,28988,0,
4,Tue Dec 20 10:56:53 +0000 2016,811163457508642817,"RT @StylishRentals: Love this! ""Palm Springs M...",True,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",,,,3065,0,...,I really have got giant ambitions. I start com...,,,55,21,31,Wed Sep 07 16:22:15 +0000 2016,19581,0,


In [6]:
df.tail()

Unnamed: 0,TweetPostedTime,TweetID,TweetBody,TweetRetweetFlag,TweetSource,TweetInReplyToStatusID,TweetInReplyToUserID,TweetInReplyToScreenName,TweetRetweetCount,TweetFavoritesCount,...,UserDescription,UserLink,UserExpandedLink,UserFollowersCount,UserFriendsCount,UserListedCount,UserSignupDate,UserTweetCount,MacroIterationNumber,tweet.place
42363,Tue Dec 20 00:25:13 +0000 2016,811004491378073600,#BusinessInsider Your Money #Travel The Bigges...,False,"<a href=""http://www.hootsuite.com"" rel=""nofoll...",,,,0,0,...,- Owner of - QB Sports and The Judge & Jury - ...,http://t.co/aUs1RvWTzE,http://www.bostonmanor.ca/,537,170,90,Mon Jan 11 00:36:46 +0000 2010,22170,449,
42364,Tue Dec 20 00:25:12 +0000 2016,811004490300223492,.@jessicaparsons @brokegirlsdiary #rockstar #D...,False,"<a href=""http://www.facebook.com/twitter"" rel=...",,,,0,0,...,Talent | Literary | Production,https://t.co/6g3HhXQBkh,https://pro-labs.imdb.com/company/co0499796/,2635,1870,321,Sat Sep 13 19:18:55 +0000 2014,15266,449,
42365,Tue Dec 20 00:25:12 +0000 2016,811004489813495808,"RT @StylishRentals: Love this! ""Dragonfly Dese...",True,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",,,,3043,0,...,"udemy instructor, author, marketeer, into tech...",,,91,54,83,Thu Sep 01 23:19:19 +0000 2016,23419,449,
42366,Tue Dec 20 00:25:12 +0000 2016,811004488932737024,"RT @StylishRentals: Love this! ""Dragonfly Dese...",True,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",,,,3043,0,...,Keep track of your cryptocurrencies and genera...,,,68,50,63,Fri Sep 02 17:16:12 +0000 2016,20737,449,
42367,Tue Dec 20 00:25:12 +0000 2016,811004487737360384,"RT @StylishRentals: Love this! ""Dragonfly Dese...",True,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",,,,3043,0,...,"ETHEREUM Foundation, project community/ecosys...",,,213,51,50,Fri Sep 02 16:52:21 +0000 2016,23982,449,


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42368 entries, 0 to 42367
Data columns (total 32 columns):
TweetPostedTime              42368 non-null object
TweetID                      42368 non-null int64
TweetBody                    42368 non-null object
TweetRetweetFlag             42368 non-null bool
TweetSource                  42368 non-null object
TweetInReplyToStatusID       101 non-null float64
TweetInReplyToUserID         189 non-null float64
TweetInReplyToScreenName     189 non-null object
TweetRetweetCount            42368 non-null int64
TweetFavoritesCount          42368 non-null int64
TweetHashtags                42268 non-null object
TweetPlaceID                 1000 non-null object
TweetPlaceName               1000 non-null object
TweetPlaceFullName           1000 non-null object
TweetCountry                 999 non-null object
TweetPlaceBoundingBox        1000 non-null object
TweetPlaceAttributes         0 non-null float64
TweetPlaceContainedWithin    0 non-null fl

In [8]:
df['TweetPostedTime'] = df['TweetPostedTime'].apply(lambda x:time_format(x))
df['TweetPostedTime'] = pd.to_datetime(df['TweetPostedTime'],format = "%Y-%m-%d %H:%M:%S")

In [9]:
print (min(df['TweetPostedTime']))
print(max(df['TweetPostedTime']))

2016-12-20 00:25:12
2016-12-20 10:57:00


The data is of one day, I am not sure how to capture temporal variation of the tweets. Data exploration will suggest the temporal variation in 'tweet retweet count'. We should generally ignore retweets which are not retweets of the tweets available in dataset as we donot know the features of the the original tweet that are only retweets in the data.

In [10]:
import re
def rem_RT(text):
    """removing RT@*: from tweet body to check the temporal 
    variation of retweet counts """
    text = str(text)
    rem_rt_text = re.sub('RT.*?:','',text, flags=re.DOTALL)
    return rem_rt_text
    

In [11]:
## using retweets to figure out if there are temporal variation overtime 
## will compare tweets and its retweets
## since other retweets which are not the tweets available in dataset will not make sense in creating model
## as we are not aware of the source tweet followers, friends counts etc
df_rt = df[df['TweetRetweetFlag']==True]
df_no_rt = df[df['TweetRetweetFlag']==False]

In [12]:
df_t = df.copy()
df_t['TweetBody'] = df_t['TweetBody'].apply(lambda x :rem_RT(x) )

In [13]:
#test_rem_RT = df_rt.copy()
df_rt['TweetBody'] = df_rt['TweetBody'].apply(lambda x :rem_RT(x) )

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [14]:
df_rt_tweet = df_rt[['TweetPostedTime', 'TweetID', 'TweetBody','TweetRetweetCount']]
rem_dup_df=df_rt_tweet.drop_duplicates(df_rt_tweet.columns)
len(rem_dup_df)

26862

In [15]:
df_no_rt_tweet = df_no_rt[['TweetPostedTime', 'TweetID', 'TweetBody','TweetRetweetCount']]

In [16]:
merged_df = pd.merge(df_no_rt_tweet, rem_dup_df, how='inner', on=['TweetBody'])

In [20]:
merged_df['flag'] = np.sign(merged_df.TweetRetweetCount_y - merged_df.TweetRetweetCount_x)

In [21]:
sum(merged_df['flag'])

0

In [22]:
#it is clear that there are no change in the tweets even if there are retweets of the same tweet so 
#We are interested in only tweets and not retweets
# so removing all retweets
df_rem_rt = df[df['TweetRetweetFlag']==False]
len(df_rem_rt)

15506

In [23]:
## Checking duplicates if any
rem_dup_df=df_rem_rt.drop_duplicates(df_rem_rt.columns)
len(rem_dup_df)

15506

In [95]:
rem_dup_df.head()


Unnamed: 0,TweetPostedTime,TweetID,TweetBody,TweetRetweetFlag,TweetSource,TweetInReplyToStatusID,TweetInReplyToUserID,TweetInReplyToScreenName,TweetRetweetCount,TweetFavoritesCount,...,UserDescription,UserLink,UserExpandedLink,UserFollowersCount,UserFriendsCount,UserListedCount,UserSignupDate,UserTweetCount,MacroIterationNumber,tweet.place
2,2016-12-20 10:56:55,811163466387988480,Seven Questions Before Choosing a Cruise Line ...,False,"<a href=""http://www.google.com/"" rel=""nofollow...",,,,0,0,...,Thrifty Mom Media social media consulting and ...,https://t.co/cEhGzaQJp6,http://www.thriftymommastips.com/,23433,24762,961,Tue May 26 21:26:09 +0000 2009,147958,0,
9,2016-12-20 10:56:48,811163434775617536,"Images from #Paris. An apartment, dinner, musi...",False,"<a href=""http://www.softwareopal.com/"" rel=""no...",,,,0,0,...,Moderate drinking is one of the great pleasure...,http://t.co/bsNw9HDu4E,http://www.scoop.it/t/about-whiskey/,1380,1333,260,Wed Oct 02 21:15:56 +0000 2013,41506,0,
10,2016-12-20 10:56:48,811163433961930752,Looking for that perfect Christmas gift? Give ...,False,"<a href=""http://www.hootsuite.com"" rel=""nofoll...",,,,0,0,...,,http://t.co/TFxlNDGwWL,http://serenityoverload.blogspot.com,441,1258,20,Thu Apr 30 02:21:24 +0000 2009,5681,0,
11,2016-12-20 10:56:47,811163432724602888,in the news - East Africa: 12 Swahili words to...,False,"<a href=""https://about.twitter.com/products/tw...",,,,0,0,...,http://t.co/6fVHooKt - Southern African onlin...,http://t.co/9g9W3Q7361,http://www.travelcomments.com,6904,2763,643,Tue Feb 03 14:27:10 +0000 2009,80215,0,
16,2016-12-20 10:56:29,811163357436854272,My latest #blog post is all about #seminyak to...,False,"<a href=""http://twitter.com/download/iphone"" r...",,,,0,0,...,"Writer, Wellness and Luxury Travel Lifestyle B...",https://t.co/AGq5ifhd6N,http://saudidiva.com,2063,2252,348,Tue Jul 16 07:00:36 +0000 2013,8955,0,


# Steps 

## Data preprocessing

## Data cleaning
- text cleaning and retain features to train the model
- Look for any anomalies and address them if required

## Model training

## Model testing
- test the best performing model on test set
- Look for bias or overfitting 

## Feature expansion and engineering
-



In [24]:
rem_dup_df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 15506 entries, 2 to 42364
Data columns (total 32 columns):
TweetPostedTime              15506 non-null object
TweetID                      15506 non-null int64
TweetBody                    15506 non-null object
TweetRetweetFlag             15506 non-null bool
TweetSource                  15506 non-null object
TweetInReplyToStatusID       101 non-null float64
TweetInReplyToUserID         189 non-null float64
TweetInReplyToScreenName     189 non-null object
TweetRetweetCount            15506 non-null int64
TweetFavoritesCount          15506 non-null int64
TweetHashtags                15406 non-null object
TweetPlaceID                 997 non-null object
TweetPlaceName               997 non-null object
TweetPlaceFullName           997 non-null object
TweetCountry                 996 non-null object
TweetPlaceBoundingBox        997 non-null object
TweetPlaceAttributes         0 non-null float64
TweetPlaceContainedWithin    0 non-null float6

In [25]:
## choosing the columns which can be features
## based on reasearch work published making taking features which can be important
## parsimonious model is subject to less overfitting
model_df = rem_dup_df[['TweetRetweetCount','UserTweetCount','UserListedCount',
                       'UserFriendsCount','UserFollowersCount','TweetBody']].reset_index(drop=True)

In [26]:
model_df['TweetRetweetCount'].describe()

count    15506.000000
mean         2.116084
std         71.928530
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max       3309.000000
Name: TweetRetweetCount, dtype: float64

 Most of the values are 0 and only few of the tweets are popular which is very practial as not all tweets can be popular.
 A normal linear regression based model can be tried but since there is a huge variation in the output labels(TweetRetweetCount) it may not be useful. We should look at other types of regressin model for this an probably quatile regression should be good choice. Also the metric like root mean square error (RMSE) is popular for these kind of task but it will not be justified to use here since it appears that most of the output is 0 or closer to zero and only a few of them are large continous variables. we can try modeified RMSE on values that are more than 0.95 percentile and values that are less that 0.95 percentile


In [29]:
## Create a test and train files
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(model_df, test_size=0.2) 
                                     

In [30]:
df_train=df_train.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [31]:
df_test.shape

(3102, 6)

In [32]:
#creating content feature from tweet body as they can significantly influence the retweet counts
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
transform_tweet = CountVectorizer(min_df =0.03,stop_words='english')## save this vectorizer for test set
x_train_tweeet = transform_tweet.fit_transform(df_train['TweetBody'])
tfidf_trfm = TfidfTransformer()
X_tr_tfidf = tfidf_trfm.fit_transform(x_train_tweeet)


In [33]:
X_tr_tfidf.shape

(12404, 18)

In [34]:
x_test = transform_tweet.transform(df_test['TweetBody'])
X_te_tfidf = tfidf_trfm.transform(x_test)

In [35]:
X_te_tfidf.shape

(3102, 18)

In [36]:
count_vect_df = pd.DataFrame(X_tr_tfidf.todense(), columns=transform_tweet.get_feature_names())
count_vect_df_test = pd.DataFrame(X_te_tfidf.todense(), columns=transform_tweet.get_feature_names())

In [37]:
df_train= pd.concat([df_train.drop(['TweetBody'], axis=1), count_vect_df], axis=1)
df_test= pd.concat([df_test.drop(['TweetBody'], axis=1), count_vect_df_test], axis=1)

In [38]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12404 entries, 0 to 12403
Data columns (total 23 columns):
TweetRetweetCount     12404 non-null int64
UserTweetCount        12404 non-null int64
UserListedCount       12404 non-null int64
UserFriendsCount      12404 non-null int64
UserFollowersCount    12404 non-null int64
amp                   12404 non-null float64
best                  12404 non-null float64
christmas             12404 non-null float64
daily                 12404 non-null float64
holiday               12404 non-null float64
https                 12404 non-null float64
japan                 12404 non-null float64
japantravel           12404 non-null float64
latest                12404 non-null float64
new                   12404 non-null float64
photography           12404 non-null float64
rt                    12404 non-null float64
thanks                12404 non-null float64
tourism               12404 non-null float64
travel                12404 non-null float64
t

In [39]:
import numpy as np
from sklearn.ensemble import GradientBoostingRegressor
np.random.seed(1)

Y_train = df_train['TweetRetweetCount']
X_test = df_test.drop('TweetRetweetCount',axis=1)
Y_test = df_test['TweetRetweetCount']
X_train= df_train.drop('TweetRetweetCount',axis=1)

alpha = 0.95

clf = GradientBoostingRegressor(loss='quantile', alpha=alpha,
                                n_estimators=100, max_depth=10,
                                learning_rate=.01, min_samples_leaf=9,
                                min_samples_split=9)

clf.fit(X_train, Y_train)



GradientBoostingRegressor(alpha=0.95, criterion='friedman_mse', init=None,
             learning_rate=0.01, loss='quantile', max_depth=10,
             max_features=None, max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=9, min_samples_split=9,
             min_weight_fraction_leaf=0.0, n_estimators=100,
             presort='auto', random_state=None, subsample=1.0, verbose=0,
             warm_start=False)

In [40]:
# Make the prediction
y_pred = clf.predict(X_test) # prediction with trained model
y_0 = np.zeros(len(Y_test.values), dtype = int) # prediction by assuming all predicitons to zero as most of the data is zero
y_mean = np.full(len(Y_test.values), np.mean(Y_test.values)) # prediction with mean value of the data just to create baselines

In [41]:
# Creatign pandas dataframe
test_array = np.vstack((Y_test.values,y_pred,y_0,y_mean)).T
# columns
columns_new = ['ground', 'prediction','prediction_0', 'prediction_mean']

# pass in array and columns
pred_df = pd.DataFrame(test_array, columns=columns_new)

In [42]:
pred_df.head()

Unnamed: 0,ground,prediction,prediction_0,prediction_mean
0,0.0,0.848392,0.0,1.491618
1,0.0,5.793615,0.0,1.491618
2,0.0,0.896065,0.0,1.491618
3,0.0,1.402822,0.0,1.491618
4,0.0,1.236779,0.0,1.491618


In [43]:
pred_df_95 = pred_df[pred_df.ground < np.percentile(pred_df.ground,98)]
pred_df_100 = pred_df[pred_df.ground > np.percentile(pred_df.ground,98)]

In [44]:
pred_df_100

Unnamed: 0,ground,prediction,prediction_0,prediction_mean
5,7.0,3.415824,0.0,1.491618
25,19.0,2.73687,0.0,1.491618
59,5.0,5.695186,0.0,1.491618
97,27.0,0.9221,0.0,1.491618
120,15.0,17.642004,0.0,1.491618
339,22.0,17.642004,0.0,1.491618
350,15.0,5.701892,0.0,1.491618
384,8.0,7.461741,0.0,1.491618
421,24.0,18.372281,0.0,1.491618
459,57.0,10.816294,0.0,1.491618


 Metrics to test the RMSE for 0.95 percentile data

In [47]:
from sklearn.metrics import mean_squared_error
import math
error_pred = math.sqrt(mean_squared_error(pred_df_95.ground,pred_df_95.prediction))
error_pred_0 = math.sqrt(mean_squared_error(pred_df_95.ground,pred_df_95.prediction_0))
error_pred_mean = math.sqrt(mean_squared_error(pred_df_95.ground,pred_df_95.prediction_mean))
print(error_pred)
print(error_pred_0)
print(error_pred_mean)

21.509702684321006
0.5065533595472461
1.4202509017353797


In [48]:
#testScore=math.sqrt(mean_squared_error(pred_df_100.ground,pred_df_100.prediction))
error_pred = math.sqrt(mean_squared_error(pred_df_100.ground,pred_df_100.prediction))
error_pred_0 = math.sqrt(mean_squared_error(pred_df_100.ground,pred_df_100.prediction_0))
error_pred_mean = math.sqrt(mean_squared_error(pred_df_100.ground,pred_df_100.prediction_mean))
print(error_pred)
print(error_pred_0)
print(error_pred_mean)

151.7312279365989
437.5379013232335
437.27747835300863


## Results and conclustion

- the quantile model doesnot look bad if one looks at the RMSE scores of different predictions at 0.95 percentile cutoff
- More exploration may be done on various features other than content features, or by creating more features from the features available like create a feature from tweeter post time and twitter signup time to understand the duration of the user on twitter
-  More visualizaiton with graphs will help in better understanding, I have not done graph visulaization just to save my time as tables are giving enough information
- Models other that quantile regression like spline regression can also be modelled toheck the accuracy. infact hyperparameter tuning can also give good results in quantile regerssion
- codestyle can be improved for understanding

# References
https://twittercommunity.com/t/twitter-api-listed-count/32759
http://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_quantile.html
http://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regression.html#sphx-glr-auto-examples-ensemble-plot-gradient-boosting-regression-py