# Data Manipulation in Python (CS2006 P2)

On the 12th of November 2014, the European Space Agency lander Philae made the first ever soft landing of a spacecraft on the surface of a comet, 67P/Churyumov-Gerasimenko, having been carried there by the probe Rosetta. The news of the acheivement was disseminated through various social media platforms over the following days and weeks, including Twitter. This notebook analyses data independently gathered from Twitter to track the volume and nature of user activity related to the landing over the period of 3 weeks after the landing.

In [66]:
#Imports
import pandas as pd
import matplotlib.pyplot as plt;
import math
import numpy as np
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
import operator
from collections import *
plotly.tools.set_credentials_file(username='ir47', api_key='k0qjUbd6owGLygMXJPZ5')

In [9]:
from wordcloud import WordCloud, STOPWORDS

ImportError: attempted relative import with no known parent package

In [2]:
df=pd.read_csv("../data/CometLanding.csv",encoding="UTF-8")

In [3]:
len(df)

77319

In [4]:
df.drop_duplicates(['id_str'],inplace = True)

In [5]:
numTweets = len(df)

The raw data contained some duplicate tweets, which were removed, and the number of total remaining unique tweets is displayed below: 

In [6]:
print("Total number of Tweets: " + str(numTweets))

Total number of Tweets: 77268


In [7]:
df = df[df['text'].notnull()]

The number of unique users is displayed below:

In [62]:
len(df['from_user'].unique()

50195

In [None]:
hashtagCount = {}

for hashtag in hashtags:
    if hashtag not in hashtagCount:
        hashtagCount[hashtag] = 1
    else:
        counter = hashtagCount.get(hashtag,'none')
        hashtagCount.update({hashtag: counter+1})
        
for key,val in hashtagCount.items():
    if val>150:
        print (repr(key) + "=>" + repr(val))
    


The number of tweets by language are displayed here. By far the most common is US English, with 52316 tweets.

In [9]:
language = df.groupby('user_lang')

In [10]:
language.size()

user_lang
ar           428
bg             1
ca           309
cs            42
da            89
de          2916
el            29
en         52316
en-AU          1
en-GB         23
en-gb       1972
es          7540
es-MX          2
eu            62
fa             2
fi           108
fil           10
fr          3313
gl            36
he             2
hi             2
hu            41
id            66
it          2664
ja          1514
ko            98
msa            1
nb             1
nl           838
no            36
pl           157
pt           508
pt-PT          1
ro             8
ru           794
sv           126
th            57
tr           761
uk            43
ur             1
vi             1
xx-lc         24
zh-CN          6
zh-Hans        6
zh-cn        285
zh-tw         27
dtype: int64

The majority of tweets were actually retweets, which meant that they started with the letters "RT":

In [11]:
dfNoRT = df[~df.text.str.startswith('RT', na=False)]

In [12]:
numReTweets = numTweets - len(dfNoRT) 

In [13]:
print("Total number of retweets: " + str(numReTweets))

Total number of retweets: 59999


Roughly 1/2 of the remainder are replies to earlier tweets.

In [72]:
dfReplies = df
numReplies = 0


for index, row in dfReplies.iterrows():
        text = (row['in_reply_to_screen_name'])
        if(not pd.isnull(text)):
            numReplies +=1

            
print(numReplies)

1723


In [73]:
print("Total number of replies: " + str(numReplies))

Total number of replies: 1723


Hashtags are an important 

In [17]:
import re
hashtags = []
for index, row in dfNoRT.iterrows():
    text = (row['text'].split(" "))       
    for token in text:
        re.sub('[\W_]', '', token)
        if token.startswith('#'):
            hashtags.append(str(token))



In [18]:
hashtagCount = {}

for hashtag in hashtags:
    if hashtag not in hashtagCount:
        hashtagCount[hashtag] = 1
    else:
        counter = hashtagCount.get(hashtag,'none')
        hashtagCount.update({hashtag: counter+1})
        
for key,val in hashtagCount.items():
    if val>150:
        print (repr(key) + "=>" + repr(val))
    


'#cometlanding'=>1834
'#CometLanding'=>12741
'#ESA'=>194
'#Rosetta'=>1471
'#Philae'=>734
'#CometLanding:'=>161
'#rosettamission'=>169
'#Cometlanding'=>165
'#67P'=>400
'#CometLanding.'=>244
'#rosetta'=>178
'#WishKoSaPasko'=>929
'#HappyBirthdaySandaraPark'=>928


In [19]:
words = []

for key,val in hashtagCount.items():
    words.append(key)
    
words = [e[1:] for e in words]
stopwords = set(STOPWORDS)
stopwords.add("CometLanding")


wordcloud = WordCloud(background_color='white',stopwords=stopwords,max_words=30000,max_font_size=40, random_state=42).generate(str(hashtags))
plt.figure(figsize=(10,10))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

NameError: name 'STOPWORDS' is not defined

In [74]:
data = [go.Bar(x=['Tweets', 'Retweets', 'Replies'],y=[numTweets,numReTweets,numReplies])]
layout = go.Layout(
    title='Number of Retweets, Replies and Tweets',
    yaxis=dict(
        title='Usage',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    )
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='basic-bar')

The above graph shows the total number of unique Tweets that were in the data set. The graph then highlights how many of those tweets are Retweets and how many are replies to other tweets. As can be clearly seen from the graph the majority of the dataset is made up of Retweets with over 59 thousand Retweets being present. The set also includes a number of replies to tweets, however, when the amount of replies is compared to the number of Retweets it shows how many more Retweets there are over replies. This is shown with there only being 1723 replies in the set compared with the over 59 thousand Retweets. 

One of the main reaons for this mass difference in the number of Retweets to replies is the nature of what each of them actually do. Retweeting a tweet is more designed to share that perticular tweet with your followers as it may be something interesting or something you agree with. Whereas a reply is used for the more conversational aspect of twitter it is there to add a response or to add more information to a tweet and in some cases it can be used as a convorsation tool between people. 

Another possible reason for Retweets taking up a larger amount of the dataset over replies is that they are mush easier to utilise than replies. When a user goes to Retweet a tweet then they can simply just click Retweet which will then share this tweet with their followers. Whereas with a reply more thought must be put into the process as the actual content of the reply must be thought of and written out. This may be one of the reasons for Retweets being more prominent as simply they are quicker and easier to use. 

In [21]:
dfSource = df
import re
items = []
for index, row in dfSource.iterrows():
    text = (row['source'])
    for token in str(text):
        if token.endswith('>'):
            split1 = text.split("</a>")
            split2 = str(split1).split(">")
            split2 = str(split2).split(",")
            items.append(str(split2[1]))
            
appCount = {}

for device in items:
    if device not in appCount:
        appCount[device] = 1
    else:
        counter = appCount.get(device,'none')
        appCount.update({device: counter+1})

In [22]:
topItems = sorted(appCount.items(), key=operator.itemgetter(1),reverse=True)

In [23]:
#Gets the top 4 applications
topApplication = topItems[0][0]
topApplicationNum = topItems[0][1]
    
secondApplication = topItems[1][0]
secondApplicationNum = topItems[1][1]
    
thirdApplication = topItems[2][0]
thirdApplicationNum = topItems[2][1]

fourthApplication = topItems[3][0]
fourthApplicationNum = topItems[3][1]

totalApplications = sum(appCount.values())
otherApplicationNum = (totalApplications-(topApplicationNum+secondApplicationNum+thirdApplicationNum+fourthApplicationNum))


In [24]:
labels = [topApplication,secondApplication,thirdApplication,fourthApplication," \"Other\'"]
values = [topApplicationNum,secondApplicationNum,thirdApplicationNum,fourthApplicationNum,otherApplicationNum]

trace = go.Pie(labels=labels, values=values)

py.iplot([trace], filename='basic_pie_chart')

In [34]:
dfSource = df
import re
dates = []
for index, row in dfSource.iterrows():
    text = (row['created_at'])
    dates.append(text[0:10])
        
len(dates)


77267

In [35]:
dateCount = {}
counter =0

for date in dates:
    if date not in dateCount:
        dateCount[date] = 1
    else:
        counter = dateCount.get(date,'none')
        dateCount.update({date: counter+1})

In [53]:
values =[]
dateKeys = []

for key,val in dateCount.items():
    values.append(val)
    dateKeys.append(str(key))
    print("Date: "+ str(key) + " Number of Tweets: " + str(val))
    


Date: Fri Dec 05 Number of Tweets: 87
Date: Thu Dec 04 Number of Tweets: 200
Date: Wed Dec 03 Number of Tweets: 311
Date: Tue Dec 02 Number of Tweets: 475
Date: Mon Dec 01 Number of Tweets: 603
Date: Sun Nov 30 Number of Tweets: 343
Date: Sat Nov 29 Number of Tweets: 428
Date: Fri Nov 28 Number of Tweets: 711
Date: Thu Nov 27 Number of Tweets: 497
Date: Wed Nov 26 Number of Tweets: 400
Date: Wed Nov 12 Number of Tweets: 73212


In [54]:
data = [go.Bar(x=dateKeys,y=values)]
layout = go.Layout(
    title='Number of Tweets per day',
    yaxis=dict(
        title='Tweets',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    )
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='basic-bar')