# <p><font color=deepskyblue>&#9658; Data Mining : Collect Twitter Data into MongoDB &#9664;</font></p>


___
# <p><font color=teal> &#129414; Instructions </font></p>

**We want to be able to extract tweets based on one or several specific hastags from twitter to then analyse them.**

To achieve this goal we will go through these steps:

- Install **MongoDB**
- Get your personal credentals from the **Twitter API**
- Select some **hastags** related to an event or movement (/!\ we will use tweepy stream and you cannot collect tweets older than one week with it, so the event has to be in trends)
- Use the Stream API to **collect tweets** to create our dataset (I collected 5000 tweets in another project, I guess 1000 would be a minimum, so it's better when it's a very trendy topic like a political election or an international sports event for instance)
- **Analyse** the collected tweets


#### Useful sources:
* [**Mining the Social Web, 2nd Edition by Matthew A. Russell**](https://www.webpages.uidaho.edu/~stevel/504/Mining-the-Social-Web-2nd-Edition.pdf) (really cool and detailed book for mining Twitter, Facebook, Linkedin, Google+, web pages, etc!)
* [twitter doc](https://developer.twitter.com/en/docs/tweets/filter-realtime/api-reference/post-statuses-filter) the API limits permit you to do the requests inside the boundaries of the API
* composition of a [twit object](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object)

#### Frequent questions:
* **Does it work only for hashtags ?** <br>
* **Is it legal ?** <br> Yes because tweets are public, it's like doing a RT. The API dosen't give you access to personal users's data. But make sure it respect the RGDP is you do more operations
* **What is the limit of tweets that you can collect ?** <br>
* **What kind of data do I have access to?** <br>

___
# <p><font color=teal> &#129414; Set Up </font></p>

* Install [pymongo](https://pymongo.readthedocs.io/en/stable/installation.html) ([installation youtube video](https://youtu.be/FwMwO8pXfq0))
* check that you have MongoDB Compass
* Install [tweepy](http://docs.tweepy.org/en/latest/install.html)
* [create a twitter application](https://developer.twitter.com/en/application/intent) on [Twitter API](https://apps.twitter.com/) and get your acess tokens / credentials <br>
(TUTORIAL FOR THIS STEP: [Authenticate a Python Application with Twitter using Tweepy](https://www.digitalocean.com/community/tutorials/how-to-authenticate-a-python-application-with-twitter-using-tweepy-on-ubuntu-14-04))




## <p><font color=lightsteelblue> &#129370; PART 1: Collect the tweets </font></p>
### a) Import libraries

In [1]:
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
import json
import pymongo

# mongo connection
from pymongo import MongoClient

### b) Load your personal secret keys

You need to create a `.keys.json` file in the root of your folder with your personal access tokens that you got on the [Twitter API](https://apps.twitter.com/) as explained in Set Up:

`[{
    "consumer_key": "",
    "consumer_secret": "",
    "access_token": "",
    "access_token_secret": ""
}]`

In [2]:
# Open and load the credentals stored in the file .keys.json
''' Assumes a file ".keys.json" where all credentals are stored as:
[{
    "consumer_key": "",
    "consumer_secret": "",
    "access_token": "",
    "access_token_secret": ""
}]
'''

with open('.keys.json', 'r') as myFile:
    data = json.load(myFile)

consumer_key = data[0]['consumer_key']
consumer_secret = data[0]['consumer_secret']
access_token = data[0]['access_token']
access_token_secret = data[0]['access_token_secret']

### c) Fetch the stream of tweets

We will now use **MongoDB** with tweetpy ! What is MongoDB ? <br>
MongoDB is a **NoSQL database** (it means that the data isn't stored in tables but in documents. It is useful for projects that need to be developed rapidely when you don't have a full overview and it offers flexibility, so it is less strict). <br>
The **documents** are saved into **JSON format**.<br>
**Databases** hold one or more collections of documents.<br>
MongoDB stores documents in **collections**. Collections are analogous to tables in relational databases.<br>
(Learn more in [Mongo DB documentation](https://docs.mongodb.com/manual/core/databases-and-collections/))

<img src="MongoDBdbcol.png" align="left" style="width: 500px;"/>

In [3]:
client = MongoClient('localhost', 27017)
db = client['analysis'] # your database name // example: client['YOUR_DB_NAME']
collection = db['loisecuriteglobale3'] # your collection name // example: db['YOUR_COLLECTION_NAME']
count_collectedtweets = 0 # made to keep track of how much tweets you have collected

class StdOutListener(StreamListener):
    """ A listener handles tweets that are received from the stream.
    This is a basic listener that just prints received tweets to stdout.
    """

    def on_data(self, data):
        print(type(data))
        collection.insert_one(json.loads(data)) # add the collected tweet into the collection
        global count_collectedtweets 
        count_collectedtweets = count_collectedtweets+1
        print(count_collectedtweets) # print the number of the collected tweet while collecting
        # Fix a limit of tweets to collect (I would say that at least 1000 would be nice)
        if count_collectedtweets < 10:
            return True #collect
        else:
            return False #stop

    def on_error(self, status):
        print(status)

if __name__ == '__main__':
    l = StdOutListener()
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)

    stream = Stream(auth, l)
    print('START COLLECTING TWEETS')
    stream.filter(track=['loisecuriteglobale', 'violencepoliciere','racisme'])
    print('--- STOP THE SCRIPT ---')
    #example: stream.filter(track=['GJ', 'giletsjaunes', 'greve','macron','retraites']) #results in french
    #example: stream.filter(track=['covid19', 'covid', 'corona','confinement','masques']) #results in various languages

START COLLECTING TWEETS
<class 'str'>
1
<class 'str'>
2
<class 'str'>
3
<class 'str'>
4
<class 'str'>
5
<class 'str'>
6
<class 'str'>
7
<class 'str'>
8
<class 'str'>
9
<class 'str'>
10
--- STOP THE SCRIPT ---


## <p><font color=lightsteelblue> &#128035; PART 2: Create a database </font></p>

### a) check and import the collection inside a .json file

If you use MongoDB compass, connect locally, then export your .json file and save it in your root folder

Connection string to connect locally: **mongodb://localhost:27017/admin?readPreference=primary&appname=MongoDB%20Compass&ssl=false**


<img src="MongoDBcompassLocalhost.png" style="width: 500px;"/>
<img src="exportcollectioninjson.png" style="width: 500px;"/>
<img src="exportcollectioninjson2.png" style="width: 500px;"/><br>

Otherwise you can use the MongoDB prompt:

`mongoexport --db YOUR_DB_NAME --collection YOUR_COLLECTION_NAME --out YOUR_COLLECTION_NAME.json`

example:`mongoexport --db analysis --collection loisecuriteglobale3 --out loisecuriteglobale3.json`

If you already have a collection, you can import it into MongoDB with:

`mongoimport -d YOUR_DB_NAME -c brexit  --file YOUR_COLLECTION_NAME.json`

### b) Display your file

Once you saved the .json file in your root folder, you can check it quickly with this:

In [4]:
import pandas as pd
data = pd.read_json('loisecuriteglobale3.json', lines=True)
data.head(3)

Unnamed: 0,_id,contributors,coordinates,created_at,display_text_range,entities,extended_entities,extended_tweet,favorite_count,favorited,...,quoted_status_permalink,reply_count,retweet_count,retweeted,retweeted_status,source,text,timestamp_ms,truncated,user
0,{'$oid': '5fd8fb03c807ac4dda964830'},,,2020-12-15 18:05:49,"[{'$numberInt': '38'}, {'$numberInt': '140'}]","{'hashtags': [], 'urls': [{'url': 'https://t.c...",,{'full_text': '@siironique @Raheem0971 @twm1s ...,{'$numberInt': '0'},False,...,,{'$numberInt': '0'},{'$numberInt': '0'},False,,"<a href=""https://mobile.twitter.com"" rel=""nofo...","@siironique @Raheem0971 @twm1s @lzylv ""il faut...",2020-12-15 18:05:49.571,True,"{'id': {'$numberLong': '1281669801249775617'},..."
1,{'$oid': '5fd8fb07c807ac4dda964831'},,,2020-12-15 18:05:53,,"{'hashtags': [], 'urls': [], 'user_mentions': ...",,,{'$numberInt': '0'},False,...,"{'url': 'https://t.co/UvLXEm0pTl', 'expanded':...",{'$numberInt': '0'},{'$numberInt': '0'},False,{'created_at': 'Tue Dec 15 17:57:05 +0000 2020...,"<a href=""https://mobile.twitter.com"" rel=""nofo...",RT @Nantes_Revoltee: Nantes ce soir : moment h...,2020-12-15 18:05:53.562,False,"{'id': {'$numberInt': '271439132'}, 'id_str': ..."
2,{'$oid': '5fd8fb09c807ac4dda964832'},,,2020-12-15 18:05:55,,"{'hashtags': [], 'urls': [], 'user_mentions': ...",,,{'$numberInt': '0'},False,...,,{'$numberInt': '0'},{'$numberInt': '0'},False,{'created_at': 'Tue Dec 15 15:45:25 +0000 2020...,"<a href=""http://twitter.com/download/android"" ...",RT @sebastianjroche: Devant la dÃ©nÃ©gation du...,2020-12-15 18:05:55.318,False,"{'id': {'$numberLong': '717476046900166657'}, ..."


### c) Format dataframe in a nice way
(I reused the code of [hectoramirez](https://github.com/hectoramirez/Language-localization_FIFA) for this part, feel free to modify it to obtain a different formating)

In [5]:
import glob

# list all files containing tweets
files = list(glob.iglob('loisecuriteglobale3.json'))

tweets_data = []
for file in files:
    
    tweets_file = open(file, "r", encoding = 'utf-8')

    # Read in tweets and store in list: tweets_data
    for line in tweets_file:
        tweet = json.loads(line)
        tweets_data.append(tweet)

    tweets_file.close()

In [6]:
print('There are', len(tweets_data), 'tweets in the dataset.')

There are 100 tweets in the dataset.


In [7]:
def flatten_tweets(tweets):
    """ Flattens out tweet dictionaries so relevant JSON is 
        in a top-level dictionary. """
    
    tweets_list = []
    
    # Iterate through each tweet
    for tweet_obj in tweets:
    
        ''' User info'''
        # Store the user screen name in 'user-screen_name'
        tweet_obj['user-screen_name'] = tweet_obj['user']['screen_name']
        
        # Store the user location
        tweet_obj['user-location'] = tweet_obj['user']['location']
    
        ''' Text info'''
        # Check if this is a 140+ character tweet
        if 'extended_tweet' in tweet_obj:
            # Store the extended tweet text in 'extended_tweet-full_text'
            tweet_obj['extended_tweet-full_text'] = \
                                    tweet_obj['extended_tweet']['full_text']
    
        if 'retweeted_status' in tweet_obj:
            # Store the retweet user screen name in 
            # 'retweeted_status-user-screen_name'
            tweet_obj['retweeted_status-user-screen_name'] = \
                        tweet_obj['retweeted_status']['user']['screen_name']

            # Store the retweet text in 'retweeted_status-text'
            tweet_obj['retweeted_status-text'] = \
                                        tweet_obj['retweeted_status']['text']
    
            if 'extended_tweet' in tweet_obj['retweeted_status']:
                # Store the extended retweet text in 
                #'retweeted_status-extended_tweet-full_text'
                tweet_obj['retweeted_status-extended_tweet-full_text'] = \
                tweet_obj['retweeted_status']['extended_tweet']['full_text']
                
        if 'quoted_status' in tweet_obj:
            # Store the retweet user screen name in 
            #'retweeted_status-user-screen_name'
            tweet_obj['quoted_status-user-screen_name'] = \
                            tweet_obj['quoted_status']['user']['screen_name']

            # Store the retweet text in 'retweeted_status-text'
            tweet_obj['quoted_status-text'] = \
                                            tweet_obj['quoted_status']['text']
    
            if 'extended_tweet' in tweet_obj['quoted_status']:
                # Store the extended retweet text in 
                #'retweeted_status-extended_tweet-full_text'
                tweet_obj['quoted_status-extended_tweet-full_text'] = \
                    tweet_obj['quoted_status']['extended_tweet']['full_text']
        
        ''' Place info'''
        if 'place' in tweet_obj:
            # Store the country code in 'place-country_code'
            try:
                tweet_obj['place-country'] = \
                                            tweet_obj['place']['country']
                
                tweet_obj['place-country_code'] = \
                                            tweet_obj['place']['country_code']
                
                tweet_obj['location-coordinates'] = \
                            tweet_obj['place']['bounding_box']['coordinates']
            except: pass
        
        tweets_list.append(tweet_obj)
        
    return tweets_list

In [8]:
def select_text(tweets):
    ''' Assigns the main text to only one column depending
        on whether the tweet is a RT/quote or not'''
    
    tweets_list = []
    
    # Iterate through each tweet
    for tweet_obj in tweets:
        
        if 'retweeted_status-extended_tweet-full_text' in tweet_obj:
            tweet_obj['text'] = \
                        tweet_obj['retweeted_status-extended_tweet-full_text']
        
        elif 'retweeted_status-text' in tweet_obj:
            tweet_obj['text'] = tweet_obj['retweeted_status-text']
            
        elif 'extended_tweet-full_text' in tweet_obj:
                    tweet_obj['text'] = tweet_obj['extended_tweet-full_text']
                
        tweets_list.append(tweet_obj)
        
    return tweets_list

In [9]:
# flatten tweets
tweets = flatten_tweets(tweets_data)

# select text
tweets = select_text(tweets)
columns = ['text', 'lang', 'user-location', 'place-country', 
           'place-country_code', 'location-coordinates', 
           'user-screen_name']

# Create a DataFrame from `tweets`
df_tweets = pd.DataFrame(tweets, columns=columns)
# replaces NaNs by Nones
#df_tweets.where(pd.notnull(df_tweets), None, inplace=True)
df_tweets.head(10)

Unnamed: 0,text,lang,user-location,place-country,place-country_code,location-coordinates,user-screen_name
0,"@siironique @Raheem0971 @twm1s @lzylv ""il faut...",fr,,,,,zoltan16k
1,"Nantes ce soir : moment historique, unique à n...",fr,FRANCE,,,,Cathylabb
2,Devant la dénégation du racisme endémique de l...,fr,,,,,samb_michele
3,Wallah ! Nique le ! Tue le ! Sale Gwer! ( sale...,fr,France - Paris,,,,LeftyMsr
4,Devant la dénégation du racisme endémique de l...,fr,"Paris, Ile-de-France",,,,Bien_monsieur
5,Ou j’ai dis mme que le racisme est un truc de ...,fr,"Rhône-Alpes, France",,,,baabeegurl1
6,Wallah ! Nique le ! Tue le ! Sale Gwer! ( sale...,fr,Paris,,,,Carl_Marois
7,Devant la dénégation du racisme endémique de l...,fr,"Troyes, France",,,,Bruno__Chatenet
8,Devant la dénégation du racisme endémique de l...,fr,,,,,nagou_san
9,Wallah ! Nique le ! Tue le ! Sale Gwer! ( sale...,fr,,,,,tricolore78


In [10]:
df_tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 7 columns):
text                    100 non-null object
lang                    100 non-null object
user-location           54 non-null object
place-country           0 non-null float64
place-country_code      0 non-null float64
location-coordinates    0 non-null float64
user-screen_name        100 non-null object
dtypes: float64(3), object(4)
memory usage: 5.5+ KB


In [11]:
# sample for quick check
df_tweets_sample = df_tweets.copy()[:50]

## <p><font color=lightsteelblue> &#128036; PART 3: Tweets analysis </font></p>

In this analysis, we decide to extract different informations from the tweets we collected:
* popularity of the search interest (defined by you)
* top 20 of the hashtags used in the collected tweets
* Original content VS Retweet
* Number of likes for the collected tweets
* Type of content (text, photo, video)

Depending of what you want to analyse, you can make these queries very. For instance if you look for information of an international event, you might want to extract and sort counrty data.
<br><br><br>

<p><font color=indianred>REQUIREMENT</font></p>

For the next step, you need to have the **mongod server running** (probably here: `MongoDB\Server\version\bin\mongod.exe`)

In [12]:
from collections import Counter
import pymongo
import re
from pymongo import MongoClient

# To import the collection to mongoDB:
# mongoimport -d analysis -c brexit  --file brexit.json

In [16]:
client = MongoClient('localhost', 27017)
db = client['analysis']
collection = db['loisecuriteglobale3']


YOUR_SEARCH_INTEREST = "Loi Securite Globale"
SEARCH_INTEREST = re.compile('stoploisecuriteglobale|loisecuriteglobale|pplsecuriteglobale')

outfile = open('tweets_analysis_output_file.txt', 'w')
outfile.write('''
    ###############
    TWEETS ANALYSIS
    ###############
    ''')

def find_popularity():
    # ----------------------
    # Mentions of __YOUR_SEARCH_INTEREST__ : x
    # Total tweets:  xxxx
    # -----------------------
    outfile.write("\nMENTIONS OF " + YOUR_SEARCH_INTEREST + ":\n--------------")
    occurence_of_search_interest = 0
    # db_max = 
    # db_min = 
    total_valid = 0

    cursor = collection.find({})
    for doc in cursor:
        # db_min += 1
        # if db_min < db_max:
        try:
            text = doc['text']
            total_valid += 1
        except KeyError:
            # print doc['_id']
            pass
        # print text
        occurence_of_search_interest += len(SEARCH_INTEREST.findall(text.lower()))
        # else:
        #     break
    outfile.write("\nMentions of " + YOUR_SEARCH_INTEREST + " : " + str(occurence_of_search_interest))
    outfile.write("\nTotal tweets: " + str(total_valid))
    outfile.write('\n')
    outfile.write('\n')


def top_20_hashtags():
    # -------------------
    # count hashtag
    # -------------------
    outfile.write("\nTOP 20 HASHTAGS:\n--------------\n")
    '''
    Traverse through all hashtags in every tweet
    - hashtags are present in the attribute - entities.hashtags
    - return value - array of hashtags, empty is no hashtag
        {"indices":[x,y], "text": <hashtag-text-here>}

    To generate word cloud: https://www.wordclouds.com/
    '''
    cursor = collection.find({})
    # db_max = 10
    # db_min = 0
    hashtag_counter = Counter()
    for doc in cursor:
        # db_min += 1
        # if db_min < db_max:
        try:
            hashtag_list = doc['entities']['hashtags']
            if len(hashtag_list) > 0:
                for ht in hashtag_list:
                    hashtag_counter[ht['text'].lower()] += 1
        except KeyError:
            pass
        # else:
            # break
    top_20 = hashtag_counter.most_common(20)

    for word, count in top_20:
        outfile.write(word + ": " + str(count) + "\n")
    outfile.write("\nTotal number of hashtags used: " + str(sum(hashtag_counter.values())) + "\n")
    outfile.write('\n')


def original_vs_retweeted():

    outfile.write("\nTYPES OF TWEETS - ORIGINAL VS RETWEETED:\n--------------")
    # Original tweets don't have the attribute
    #   retweeted_status
    # Original
    # db.getCollection('loisecuriteglobale3').find({"retweeted_status":{$eq:null}},{}).length()
    res = collection.count_documents({"retweeted_status":{'$eq':None}},{})
    outfile.write("\nNumber of original tweets: " + str(res))
    # Retweeted
    # db.getCollection('loisecuriteglobale3').find({"retweeted_status":{$ne:null}},{}).length()
    res = collection.count_documents({"retweeted_status":{'$ne':None}},{})
    outfile.write("\nNumber of retweeted tweets: " + str(res))
    outfile.write('\n')
    outfile.write('\n')

def fav_counts():
    # find no of tweets greater than a number - ex: 50,000 => Outputs: 3
    # raw mongo query
    # db.getCollection('brexit').find({'retweeted_status.favorite_count':{$gt:30000}}).length()
    outfile.write("\nCOUNTS LIKES ON TWEETS:\n--------------")
    times = [1, 100, 1000, 5000, 10000, 50000]
    for value in times:
        res = collection.count_documents({'retweeted_status.favorite_count':{'$gt':value}})
        outfile.write("\nNumber of tweets favorited more than " + str(value) + " times: " + str(res))
    outfile.write('\n')
    outfile.write('\n')

def tweet_type():

    outfile.write("\nTYPES OF TWEET CONTENT - TEXT / AUDIO / VIDEO\n--------------")
    # only text: 4822
    # db.getCollection('loisecuriteglobale3').find({'entities.media.type':{$eq:null}}).length()
    res = collection.count_documents({'entities.media.type':{'$eq':None}})
    outfile.write('\nTweets with only text: ' + str(res))
    # ----
    # contains Photo: 178
    # db.getCollection('loisecuriteglobale3').find({'extended_entities.media.type':'photo'}).length()
    res = collection.count_documents({'extended_entities.media.type':'photo'})
    outfile.write('\nTweets with images: ' + str(res))
    # ----
    # contains Video: 19
    # db.getCollection('loisecuriteglobale3').find({'extended_entities.media.type':'video'}).length()
    res = collection.count_documents({'extended_entities.media.type':'video'})
    outfile.write('\nTweets with videos: ' + str(res))
    # ----
    outfile.write('')


def main():
    find_popularity()
    top_20_hashtags()
    original_vs_retweeted()
    fav_counts()
    tweet_type()

if __name__ == '__main__':
    main()

In [17]:
f = open('tweets_analysis_output_file.txt', 'r')
file_contents = f.read()
print (file_contents)
f.close()


    ###############
    TWEETS ANALYSIS
    ###############
    
MENTIONS OF Loi Securite Globale:
--------------
Mentions of Loi Securite Globale : 7
Total tweets: 110


TOP 20 HASHTAGS:
--------------
nantes: 4
loisecuriteglobale: 4
marchedeslibertés: 4
stoploisecuriteglobale: 3
france: 1
macron: 1
polqc: 1

Total number of hashtags used: 18


TYPES OF TWEETS - ORIGINAL VS RETWEETED:
--------------
Number of original tweets: 29
Number of retweeted tweets: 81


COUNTS LIKES ON TWEETS:
--------------
Number of tweets liked more than 1 times: 75
Number of tweets liked more than 100 times: 51
Number of tweets liked more than 1000 times: 10
Number of tweets liked more than 5000 times: 1
Number of tweets liked more than 10000 times: 0
Number of tweets liked more than 50000 times: 0


TYPES OF TWEET CONTENT - TEXT / AUDIO / VIDEO
--------------
Tweets with only text: 104
Tweets with images: 6
Tweets with videos: 0


## <p><font color=lightsteelblue> &#129413; PART 4: Data visualisation </font></p>