# Project information

**Project title**: Analyzing Social Media Data in Python

**Name:** Samia Khalid

**Email address associated with your DataCamp account:** samk.3211@gmail.com 

- Fear of missing out, curiosity, self-esteem, speed (patience - what's that?!), convinience: these seem to have become basic human needs in today's digital era reigned by social media. And a social media platform that rules them all when it comes to satisfying these needs is the mighty Twitter. Twitter has a disproportionate influence on the world because of the nature of its audience -- it's a hub for politicians, celebrities and influencers. This makes it **indisputably a great source for understanding what people are strongly feeling about at any given moment**. So let's put our Data Scientists hat on and learn to analyze Twitter's **rich open dataset**.
 
 
- For the purpose of this project we will be using pre-downloaded datasets which you can find in the 'datasets' folder of this project.
- Note: this project assumes that you are familiar with Python and basic data visualizations techniques like plotting histograms using matplotlib.

# Project introduction

## 1. What's the hype?

Tweets are happening at lightening speed and are available for consumption as they happen in near real time. This means a treat for people working with data, like us, because **we can easily learn about dominant thought patterns and moods around the globe**! 
<br>
To collect this wealth of information, we need to go through some authentication steps and call the Twitter APIs. The scope of this project is  to extract insights from Twitter data, so we are going to skip the data collection process. I have done that part for you already and stored the results in json files => we can concentrate on the fun part! So ***roll-up your sleeves and let the fun begin!***

Let's start by reading the data for topics that were trending worldwide and in the United States at the moment of creation of this project (snapshot of response from Twitter's GET trends/place api call).  

***Note***: If you want to retrieve the latest trends, or just because you want to experiment with the Twitter API yourself, here is the link to their documentation: https://developer.twitter.com/en/docs/trends/trends-for-location/api-reference/get-trends-place.html

![title](img/twitter2.png)

In [1]:
import json 

# Let's read the WW_trends and US_trends datasets into WW_trends and US_trends respectively
WW_trends = json.loads(open('datasets/twitter_WWTrends.json').read())
US_trends = json.loads(open('datasets/twitter_USTrends.json').read())

# Print out WW and US trends
print(WW_trends)
print()
print(US_trends)

[{'trends': [{'name': '#BeratKandili', 'url': 'http://twitter.com/search?q=%23BeratKandili', 'promoted_content': None, 'query': '%23BeratKandili', 'tweet_volume': 46373}, {'name': '#GoodFriday', 'url': 'http://twitter.com/search?q=%23GoodFriday', 'promoted_content': None, 'query': '%23GoodFriday', 'tweet_volume': 81891}, {'name': '#WeLoveTheEarth', 'url': 'http://twitter.com/search?q=%23WeLoveTheEarth', 'promoted_content': None, 'query': '%23WeLoveTheEarth', 'tweet_volume': 159698}, {'name': '#195TLdenTTVerilir', 'url': 'http://twitter.com/search?q=%23195TLdenTTVerilir', 'promoted_content': None, 'query': '%23195TLdenTTVerilir', 'tweet_volume': None}, {'name': '#AFLNorthDons', 'url': 'http://twitter.com/search?q=%23AFLNorthDons', 'promoted_content': None, 'query': '%23AFLNorthDons', 'tweet_volume': None}, {'name': 'Shiv Sena', 'url': 'http://twitter.com/search?q=%22Shiv+Sena%22', 'promoted_content': None, 'query': '%22Shiv+Sena%22', 'tweet_volume': None}, {'name': 'Lyra McKee', 'url': 

## 2. Let's prettify our output

The above output is not so easy to read, isn't it? Let's use the built-in json package to force a nicer display.

We can use **json.dumps()** to have our output formatted as a JSON string. And if we also pass a parameter called **'indent'** to the method (a non-negative integer), then JSON array elements and object members will be pretty-printed with that indent level. An indent level of 0 will only insert newline, while if the indent level is omitted then the most compact representation will be selected by default.

In [2]:
import json

print (json.dumps(WW_trends, indent=1))
print("------------------------------")
print (json.dumps(US_trends, indent=1))

[
 {
  "trends": [
   {
    "name": "#BeratKandili",
    "url": "http://twitter.com/search?q=%23BeratKandili",
    "promoted_content": null,
    "query": "%23BeratKandili",
    "tweet_volume": 46373
   },
   {
    "name": "#GoodFriday",
    "url": "http://twitter.com/search?q=%23GoodFriday",
    "promoted_content": null,
    "query": "%23GoodFriday",
    "tweet_volume": 81891
   },
   {
    "name": "#WeLoveTheEarth",
    "url": "http://twitter.com/search?q=%23WeLoveTheEarth",
    "promoted_content": null,
    "query": "%23WeLoveTheEarth",
    "tweet_volume": 159698
   },
   {
    "name": "#195TLdenTTVerilir",
    "url": "http://twitter.com/search?q=%23195TLdenTTVerilir",
    "promoted_content": null,
    "query": "%23195TLdenTTVerilir",
    "tweet_volume": null
   },
   {
    "name": "#AFLNorthDons",
    "url": "http://twitter.com/search?q=%23AFLNorthDons",
    "promoted_content": null,
    "query": "%23AFLNorthDons",
    "tweet_volume": null
   },
   {
    "name": "Shiv Sena",
    "ur

Thanks to json, now it's much easier to analyze our output! As you can see, the output (snapshot of response from Twitter's GET trends/place api call) is an array of trend objects that encode the name of the trending topic, the query parameter that can be used to search for the topic on Twitter Search, and the Twitter Search URL for that trend. According to the online documentation, these results are updated every five minutes.<br><br>
For example, we can see that at the moment of the query, #BeratKandili,#GoodFriday and #WeLoveTheEarth were trending worldwide, and #WeLoveTheEarth was being talked about more than the other two topics ("tweet_volume" tells us that these trends are not sorted by volume).<br>
If we scroll down to the US trends, we can see that there are some trends that are unique to the US. However, since there is also some overlap in the trends, let's dig deeper into the trends that are influencing everyone.

## 3. USA vs the World: what do we all want to talk about?

Although it’s easy enough to skim the two sets of trends and look for commonality, we can use [Python’s set data structure](https://docs.python.org/2/library/sets.html) (unordered collections of unique elements) to automatically compute this for us.

Let's parse the names of the trending topics by iterating through the two trends objects, cast the lists of names to sets, and compute the intersection by using the built-in intersection method to get the common topics between the two sets. 

In [3]:
world_trends = set([trend['name'] 
                        for trend in WW_trends[0]['trends']])

us_trends = set([trend['name'] 
                     for trend in US_trends[0]['trends']]) 

common_trends = world_trends.intersection(us_trends)

print(world_trends)
print()
print(us_trends)
print()
print (common_trends)
print (len(common_trends))

{'#CHIvLIO', '池袋の事故', '#HayırlıCumalar', 'Priyanka Chaturvedi', '#BLACKPINKxCorden', '#Ontas', '#NRLBulldogsSouths', '刀ステ', '#DuyguAsena', '#HardikPatel', '#HanumanJayanti', 'Shiv Sena', '#DragRace', '#TheJudasInMyLife', '高齢者', '#NikahUmurBerapa', '브이알', '#DinahJane1', '十二国記', 'Derry', '#195TLdenTTVerilir', 'Derrick White', '#ViernesSanto', 'Lil Dicky', '#KpuJanganCurang', '#ShivSena', 'Hemant Karkare', '#يوم_الجمعه', '#ProtestoEdiyorum', 'Berat Kandilimiz', '#19aprile', '#HayırlıKandiller', '#Jersey', '#JunquerasACN', 'プリウス', '#اغلاق_BBM', '#GoodFriday', '東京・池袋衝突事故', '歩行者', '#Karfreitag', '#IndonesianElectionHeroes', 'グレア', '#ConCalmaRemix', '#AFLNorthDons', 'Lyra McKee', '重体の女性と女児', '#BeratKandili', '#WeLoveTheEarth', '免許返納', 'örgütdeğil arkadaşgrubu'}

{'George Conway', '#LilDicky', '#BLACKPINKxCorden', '#NRLBulldogsSouths', '#TimeToImpeach', '#GSWvsLAC', '#rupaulsdragrace', '#WeirdDateStories', 'Yvie', '#WhatStopsYouFromGoingHome', '#DragRace', '#TheLegendOfVoxMachina', '#Earth', '

So, out of the two sets of trends (each of size 50), we have 11 overlapping topics. And from the output of the intersection, there is one common trend that stands out for me - **#WeLoveTheEarth**. It's so good to see that a major common trend is a positive influence and that everyone is talking about loving Mother Earth! 

Note: the amount of overlap, if any, between two sets of trends is dependent on what’s actually happening in the Twitter world at that moment in time.

![title](img/twitter3.jpg)

## 4. Exploring #WeLoveTheEarth


Since we are curious to explore and learn all about #WeLoveTheEarth, I used it as the basis of the search query to fetch some tweets for further analysis and have stored the reponse in 'data/WeLoveTheEarth.json'.

TODO: extend explanation

(If you want to play with the search API yourself, you can refer to this documentation: https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets.html)

In [4]:
tweets = json.loads(open('datasets/WeLoveTheEarth.json').read())
tweets[0:2]

[{'created_at': 'Fri Apr 19 08:46:48 +0000 2019',
  'id': 1119160405270523904,
  'id_str': '1119160405270523904',
  'text': 'RT @lildickytweets: 🌎 out now #WeLoveTheEarth https://t.co/L22XsoT5P1',
  'truncated': False,
  'entities': {'hashtags': [{'text': 'WeLoveTheEarth', 'indices': [30, 45]}],
   'symbols': [],
   'user_mentions': [{'screen_name': 'lildickytweets',
     'name': 'LD',
     'id': 1209516660,
     'id_str': '1209516660',
     'indices': [3, 18]}],
   'urls': [{'url': 'https://t.co/L22XsoT5P1',
     'expanded_url': 'https://youtu.be/pvuN_WvF1to',
     'display_url': 'youtu.be/pvuN_WvF1to',
     'indices': [46, 69]}]},
  'metadata': {'iso_language_code': 'en', 'result_type': 'recent'},
  'source': '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>',
  'in_reply_to_status_id': None,
  'in_reply_to_status_id_str': None,
  'in_reply_to_user_id': None,
  'in_reply_to_user_id_str': None,
  'in_reply_to_screen_name': None,
  'user': {'id': 212

As you can see there’s a lot more to a tweet than meets the eye! We have a lot more data than just what makes up the 140 characters of text that’s normally thought of as a tweet.

#### TODO: explain output

 ## 5. Extracting data from tweets for #WeLoveTheEarth

In [5]:
texts = [ tweet['text'] 
                 for tweet in tweets ]

names = [ user_mention['screen_name'] 
                 for tweet in tweets
                     for user_mention in tweet['entities']['user_mentions'] ]

hashtags = [ hashtag['text'] 
             for tweet in tweets
                 for hashtag in tweet['entities']['hashtags'] ]

# Compute a collection of all words from the texts of all the tweets
words = [ w 
          for t in texts 
              for w in t.split() ]

# Explore the first 5 items for each...
print (json.dumps(texts[0:5], indent=1))
print (json.dumps(names[0:5], indent=1))
print (json.dumps(hashtags[0:5], indent=1))
print (json.dumps(words[0:5], indent=1))

[
 "RT @lildickytweets: \ud83c\udf0e out now #WeLoveTheEarth https://t.co/L22XsoT5P1",
 "\ud83d\udc9a\ud83c\udf0e\ud83d\udc9a  #WeLoveTheEarth \ud83d\udc47\ud83c\udffc",
 "RT @cabeyoomoon: Ta piosenka to bop,  wpada w ucho  i dochody z niej id\u0105 na dobry cel,  warto s\u0142ucha\u0107 w k\u00f3\u0142ko i w k\u00f3\u0142ko gdziekolwiek si\u0119 ty\u2026",
 "#WeLoveTheEarth \nCzemu ja si\u0119 pop\u0142aka\u0142am",
 "RT @Spotify: This is epic. @lildickytweets got @justinbieber, @arianagrande, @halsey, @sanbenito, @edsheeran, @SnoopDogg, @ShawnMendes, @Kr\u2026"
]
[
 "lildickytweets",
 "cabeyoomoon",
 "Spotify",
 "lildickytweets",
 "justinbieber"
]
[
 "WeLoveTheEarth",
 "WeLoveTheEarth",
 "WeLoveTheEarth",
 "EARTH",
 "WeLoveTheEarth"
]
[
 "RT",
 "@lildickytweets:",
 "\ud83c\udf0e",
 "out",
 "now"
]


## 6. Creating a basic frequency distribution from the words in tweets

In [6]:
from collections import Counter

for item in [words, names, hashtags]:
    c = Counter(item)
    print (c.most_common()[:10]) # top 10
    print()

[('RT', 416), ('#WeLoveTheEarth', 298), ('the', 143), ('to', 128), ('a', 94), ('is', 84), ('and', 66), ('this', 65), ('@lildickytweets', 61), ('się', 47)]

[('lildickytweets', 102), ('LeoDiCaprio', 44), ('ShawnMendes', 33), ('halsey', 31), ('ArianaGrande', 30), ('justinbieber', 29), ('Spotify', 26), ('edsheeran', 26), ('sanbenito', 25), ('SnoopDogg', 25)]

[('WeLoveTheEarth', 313), ('4future', 12), ('19aprile', 12), ('EARTH', 11), ('fridaysforfuture', 10), ('EarthMusicVideo', 3), ('ConCalmaRemix', 3), ('Earth', 3), ('aliens', 2), ('AvengersEndgame', 2)]



## 7. Calculating lexical diversity for tweets

In [7]:
# A function for computing lexical diversity
def lexical_diversity(tokens):
    return 1.0*len(set(tokens))/len(tokens) 

# A function for computing the average number of words per tweet
def average_words(tweets):
    total_words = sum([ len(s.split()) for s in tweets ]) 
    return 1.0*total_words/len(tweets)

print (lexical_diversity(words))
print (lexical_diversity(names))
print (lexical_diversity(hashtags))
print (average_words(texts))

0.3041028781383956
0.24226110363391656
0.10565110565110565
15.64176245210728


## 8. Finding the most popular retweets

In [8]:
retweets = [
            (tweet['id'],
             tweet['retweet_count'], 
             tweet['retweeted_status']['user']['screen_name'],
             tweet['text']) 
            
            for tweet in tweets 
                if 'retweeted_status' in tweet
           ]



from prettytable import PrettyTable
pt = PrettyTable(field_names=['Retweet Id','Count', 'Screen Name', 'Text'])
[ pt.add_row(row) for row in sorted(retweets, reverse=True)[:5] ]
pt.max_width['Text'] = 65
pt.align= 'l'
print (pt)


print()
print("OR, we can simply display retweets if we want to skip the prettytable")

+---------------------+-------+----------------+-------------------------------------------------------------------+
| Retweet Id          | Count | Screen Name    | Text                                                              |
+---------------------+-------+----------------+-------------------------------------------------------------------+
| 1119160405270523904 | 7482  | lildickytweets | RT @lildickytweets: 🌎 out now #WeLoveTheEarth                     |
|                     |       |                | https://t.co/L22XsoT5P1                                           |
| 1119160400602222595 | 9     | cabeyoomoon    | RT @cabeyoomoon: Ta piosenka to bop,  wpada w ucho  i dochody z   |
|                     |       |                | niej idą na dobry cel,  warto słuchać w kółko i w kółko           |
|                     |       |                | gdziekolwiek się ty…                                              |
| 1119160395439034369 | 4288  | Spotify        | RT @Spotify: Th

## 9. Looking for more intresting patterns in the data

In [9]:
tweets_lang = []
for tweet in tweets:
        lang = tweet['lang']
        tweets_lang.append(lang)

In [10]:
import matplotlib.pyplot as plt

plt.hist(tweets_lang)
plt.show()

<Figure size 640x480 with 1 Axes>

In [11]:
tweets_src = []
for tweet in tweets:
    c = tweet['source']
    tweets_src.append(c)
        
tweets_src

['<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>',
 '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
 '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>',
 '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>',
 '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>',
 '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>',
 '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>',
 '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>',
 '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>',
 '<a href="https://mobile.twitter.com" rel="nofollow">Twitter Web App</a>',
 '<a href="https://mobile.twitter.com" rel="nofollow">Twitter Web App</a>',
 '<a href="http://twitter.com/download/android" rel="nofollow">Twit

In [12]:
import re
tweets_source_new = []

for i in range(len(tweets_src)):
    match = re.search('>(.+?)<', tweets_src[i])
    #print(match.group(1))
    try:
        tweets_source_new.append(match.group(1))
    except AttributeError:
        tweets_source_new.append(tweets_src[i])

In [13]:
'''
import numpy as np
from itertools import groupby
freq = {key:len(list(group)) for key, group in groupby(np.sort(tweets_source_new))}
print(freq)
'''

for item in [tweets_source_new]:
    c = Counter(item)
c

Counter({'Twitter for Android': 232,
         'Twitter for iPhone': 246,
         'Twitter Web Client': 19,
         'Twitter Web App': 20,
         'Twitter for iPad': 2,
         'Falcon Social Media Management ': 1,
         'Bot Libre!': 1,
         'IFTTT': 1})

## 10. GeoLocation and Interactive Maps 
#### Explaining bounding boxes, extracting coordinates, processing coordinates and then visualizing info using Basemap - the project might get too long.. (I think it's already too long compared to the example project:d) Need some feedback here. Thanks!