# Twitter data wrangling practice

__author__: Yongchan Kwon

__email__: yk3012@columbia.edu

In this notebook, we use the [Twitter](https://developer.twitter.com/en/products/twitter-api) dataset to practice preprocessing hierarchical data to tabular data.



In [1]:
# # To load datasets from your google drive account, this command should be executed.
# from google.colab import drive
# drive.mount('/content/drive')

- Read `singapore_twitter.json` file using `json.load()`, storing it to a variable called `twitter_data`.
 - Check the correct input argument of `json.load()`.
 - Check the type of `json.load()` output the number of tweets.
 - Print first tweet.
 - Check the type of the first tweet

In [4]:
import pandas as pd
import numpy as np
import json

In [9]:
with open('singapore_twitter.json', 'r', encoding='utf-8') as file:
    twitter_data = json.load(file)

In [10]:
print("Number of tweets:", len(twitter_data))

Number of tweets: 13740


In [14]:
# Check if the data is a list and not empty
if isinstance(twitter_data, list) and len(twitter_data) > 0:
    # Get the first tweet
    first_tweet = twitter_data[0]

    # Print the first tweet
    print("First tweet:", first_tweet)

    # Check the type of the first tweet
    print("Type of first tweet:", type(first_tweet))
else:
    print("twitter_data is not a list or it is empty.")

First tweet: {'referenced_tweets': [{'type': 'replied_to', 'id': '1491816184240480284'}], 'id': '1491816347117916162', 'in_reply_to_user_id': '204970988', 'reply_settings': 'everyone', 'entities': {'mentions': [{'start': 0, 'end': 12, 'username': 'QuantoQuant', 'id': '4332053537'}], 'annotations': [{'start': 241, 'end': 249, 'probability': 0.8172, 'type': 'Place', 'normalized_text': 'singapore'}]}, 'text': "@QuantoQuant I don't think we should have death penalty for dealers, but at the very least I think they should be deported if they are here illegally. Alternatively, I think they should go to prison for involuntary manslaughter. \n\nI believe singapore has only had a few dozen put to death.", 'source': 'Twitter Web App', 'created_at': '2022-02-10T16:48:28.000Z', 'public_metrics': {'retweet_count': 0, 'reply_count': 1, 'like_count': 2, 'quote_count': 0}, 'author_id': '204970988'}
Type of first tweet: <class 'dict'>


- Convert the data type of `twitter_data` to `pandas.DataFrame`.
  - what are interesting observations?
  - print data type for each column.

In [17]:
# Convert to DataFrame
df = pd.DataFrame(twitter_data)

# Show the first few rows
print(df.head())

# Print data types of each column
print(df.dtypes)

                                   referenced_tweets                   id  \
0  [{'type': 'replied_to', 'id': '149181618424048...  1491816347117916162   
1  [{'type': 'replied_to', 'id': '149178069710676...  1491813901842984962   
2  [{'type': 'retweeted', 'id': '1491257043856277...  1491604001720516608   
3  [{'type': 'retweeted', 'id': '1491822338874224...  1491822838290034693   
4  [{'type': 'retweeted', 'id': '1491821009225359...  1491822830316503043   

  in_reply_to_user_id reply_settings  \
0           204970988       everyone   
1          1382974668       everyone   
2                 NaN       everyone   
3                 NaN       everyone   
4                 NaN       everyone   

                                            entities  \
0  {'mentions': [{'start': 0, 'end': 12, 'usernam...   
1  {'mentions': [{'start': 0, 'end': 11, 'usernam...   
2  {'mentions': [{'start': 3, 'end': 19, 'usernam...   
3  {'urls': [{'start': 101, 'end': 124, 'url': 'h...   
4  {'mentions': 

Twitter dataset has a **hierarchical structure**. For each tweet,
  - `referenced_tweets`
    - `type`
    - `id`
  - `id`
  - `entities`
    - `mentions`
        - `start`
        - `end`
        - ...
    - `annotations`
  - and more ...

### Let's wrangle data

- Please wrangle the raw hierarchical dataset so we can look at the number of tweets/retweets for each entity over the different hours in the day.
  - Create a `pandas.DataFrame` called `df_tweet`.
  - Each row corrsponds a tweet or retweet.
  - Make columns `id` and `author_id` from the dataset. Also, make a column `reply_count` using `reply_count` in `public_metrics` information.
  - Make a column called `hour` using `created_at` information.
   - For instance, `2022-02-10T16:38:45.000Z` --> `16`
  - Make a column called `is_retweet` that indicates if it is a tweet or retweet. Hint: every retweet starts with `RT` in `text` information.


#### Data analysis
  - How many samples are duplicated?

In [None]:
num_duplicates = df.duplicated().sum()

所以弗騰堡 柏林 

TypeError: unhashable type: 'list'

- How many samples have at least one missing value?

- What is the proportion of retweets?

- What is the proportion of users with only one post?

- What is the proportion of posts with zero reply?

- Plot a histogram of the number of posts by users.
 - For a pretty figure, consider a user with 5 or more posts.
 - Add informative x-label and y-label.

- When do people post tweets/retweets? Plot a line such that
 - x-axis indicates 24 hours
 - y-axis indicates the number of posts

### Let's wrangle data 2

**Connectivity on Twitter:** How connected are the tweeter users in our dataset? A connection here is defined as someone mentioning another user.

 - Create a `pandas.DataFrame` such that each row and column corresponds to a person. A row indicates an original poster and a column corresponds to users mentioned at least once by one of original posters.
  - If a user mentioned by an original poster, then the corresponding value is 1
  - Otherwise, 0.
  - For each tweet, the `author_id` of the user mentioned is provided in `[entities]-[mentions]`.
    - For instance,
    ```
    # twitter_data[51]["entities"]["mentions"] # 4 mentions
    > [{'start': 0,
  'end': 15,
  'username': 'RajeshKumar_TT',
  'id': '1268390945873149952'},
 {'start': 16, 'end': 26, 'username': 'elemelopq', 'id': '26522834'},
 {'start': 27, 'end': 40, 'username': 'Flying_Mallu', 'id': '95138568'},
 {'start': 41, 'end': 55, 'username': 'ShashiTharoor', 'id': '24705126'}]
    ```
    - For some tweets, `entities` does not exist.
    ```
    # twitter_data[25] # no entities
    > {'text': 'Halsey tak nak buat Asia Tour ka? At least mai la Singapore/ Jakarta.',
 'id': '1491822511125766171',
 'created_at': '2022-02-10T17:12:58.000Z',
 'public_metrics': {'retweet_count': 0,
  'reply_count': 0,
  'like_count': 0,
  'quote_count': 0},
 'source': 'Twitter for iPhone',
 'author_id': '915159330',
 'reply_settings': 'everyone'}
    ```
    - For some entities, `mentions` does not exist.
    ```
    # twitter_data[7]["entities"]
    > {'hashtags': [{'start': 0, 'end': 6, 'tag': 'MBIOI'},
  {'start': 39, 'end': 47, 'tag': 'Ironore'},
  {'start': 83, 'end': 91, 'tag': 'Qingdao'}],
 'annotations': [{'start': 213,
   'end': 221,
   'probability': 0.9565,
   'type': 'Place',
   'normalized_text': 'Singapore'}]}
    ```

In [3]:
'''
Hint: Define a dictionary whose key is an author `id` and its associated value is a set of neighborhoods' ids.
And create a zero data frame and put one to every (key, value) pair.
'''
tmp_dict={'a1': {'b1', 'b2'},
          'a2': {'b2', 'b3'},
          'a3': {'b3'}}

authors=tmp_dict.keys()
references = {n_id for key, value in tmp_dict.items() for n_id in value}
authors, references

(dict_keys(['a1', 'a2', 'a3']), {'b1', 'b2', 'b3'})

In [4]:
df = pd.DataFrame(0, index=list(authors), columns=list(references))
for a in tmp_dict.keys():
  for b in tmp_dict[a]:
    df.loc[a,b]=1
df

Unnamed: 0,b1,b2,b3
a1,1,1,0
a2,0,1,1
a3,0,0,1


In [5]:
authors, references

(dict_keys(['a1', 'a2', 'a3']), {'b1', 'b2', 'b3'})

### Regular expression

- Print every post that mention `machine learing`.
 - A single space can be replaced with the underscore (`machine_learning`) or the hyphen (`machine-learning`).
 - The `machine` and `learning` can start with the capital letter. For instance, `MachineLearning`, `Machine_Learning`, `Machine-Learning`, or `Machine learning`, ...


In [6]:
import re

- Some posts mention machine learning with `"ML"`
```
twitter_data[9131]['text']
>"RT @DisruptiveAsean: New Singapore @Rackspace Technology Report Finds AI/ML Technologies Increasingly Mission-Critical, but Full Benefits H…"
```
- Now create a list of posts that mention machine learning. Consider a `"ML"` post together.
