<img src="../../../images/banners/pandas-cropped.jpeg" width="600"/>

<a class="anchor" id="intro_to_data_structures"></a>
# <img src="../../../images/logos/pandas.png" width="23"/> DataFrame Mini Project: Twitter Data

## <img src="../../../images/logos/toc.png" width="20"/> Table of Contents
* [Required Libraries](#required-libraries)
* [Call API](#call-api)
* [Extract First name and Last name](#extract-firstname-and-lastname)
* [Predict Gender](#predict-gender)
* [Your Turn!](#your-turn)

---

# Required Libraries

Twitter has API that you can use to extract tweets and users. Here we are using tweepy library to do so.

In [1]:
!pip install tweepy



In [2]:
import tweepy

In [None]:
from pathlib import Path
import pandas as pd
import json
import os
from tqdm import tqdm

# Connect to the API

To use twitter API, you first need to get a developer account. Read [here](https://developer.twitter.com/en/docs/twitter-api/getting-started/getting-access-to-the-twitter-api) to learn how to do that.

After you get an account, you can get access tokens and keys to authenticate and call the API.

In [4]:
CONSUMER_KEY = os.environ['CONSUMER_KEY']
CONSUMER_SECRET = os.environ['CONSUMER_SECRET']
ACCESS_TOKEN = os.environ['ACCESS_TOKEN']
ACCESS_TOKEN_SECRET = os.environ['ACCESS_TOKEN_SECRET']
BEARER_TOKEN = os.environ['BEARER_TOKEN']

In [5]:
auth = tweepy.OAuth2AppHandler(
    CONSUMER_KEY, CONSUMER_SECRET
)
api = tweepy.API(auth)

In [6]:
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
api = tweepy.API(auth, wait_on_rate_limit=True)

# Call the API

In [15]:
tweets = []
for page in tqdm(tweepy.Cursor(
    api.search_tweets,
    tweet_mode='extended',
    q = "#مهسا_امینی",
    count = 100,
    # lang="en",
).pages(1000)):
    for tweet in page:
        json_data = tweet._json
        with open(f'./data/twitter_data/{json_data["id"]}.json', 'w') as f:
            json.dump(json_data, f)

# Read Dumped Data

In [18]:
DATA_DIR = Path('./data/twitter_data')

In [19]:
def read_json(file_path):
    with open(file_path) as f:
        return json.load(f)

In [22]:
rows = []
for file_path in tqdm(DATA_DIR.iterdir()):
    if file_path.is_dir():
        continue
    d = read_json(file_path)
    
    rows.append(dict(
        name = d['user']['name'],
        followers = d['user']['followers_count'],
        following = d['user']['friends_count'],
        follower_following_ratio =  d['user']['followers_count'] / (d['user']['friends_count'] + 1),
        text = d.get('full_text') or d.get('text'),
        hashtags = list(map(lambda item: item['text'], d['entities']['hashtags'])),
        likes = d['favorite_count'],
        retweets = d['retweet_count'],
    ))

70853it [00:14, 4904.02it/s]


In [23]:
df = pd.DataFrame(rows)

In [29]:
pd.set_option('display.min_rows', 20)
pd.set_option('display.max_colwidth', 200)