# 1. Creating New Twitter Dataset

**Author:** Tori Stiegman   
**Project:** Gender-Inclusive Language in Tweets about Menstruation   
**Date turned in:** Dec 19, 2022
**Updated:** Feb 28, 2023

**About this notebook:** In this notebook, I will extract tweets and their data from documents of output from the Twitter API. I will then export this data to a new CSV file. 

**Table of Contents:**
1. [Load in Raw Data](#sec1)
2. [Create dataframe](#sec2)
3. [Drop Duplicates](#sec3)
4. [Create CVS](#sec4)

In [3]:
import json
import tweepy
import numpy as np
import advertools as adv

import pandas as pd
pd.set_option('display.max_colwidth', None)

# get rid of warnings pls
import warnings
warnings.filterwarnings('ignore')

<a name="sec1"></a>
## Load in the Raw Data

From data collected from Eni in the folder `twitter-dataset-1`. It contains 325 files, with roughly 1000 tweets/file

**Keywords:** 
- menstrual cycle
- tracking your period
- period tracking
- track your menstrual cycle
- period tracker
- menstruation 

**Timeline of Tweets:**
Start: 11/10/20
End: 11/10/22


In [2]:
tweetListFull = []

for i in range(327):
    file_name = "results_" + str(i) + ".json"
    with open(file_name, 'r') as inFile:
        tweetListFull.append(json.load(inFile))

<a name="sec2"></a>
## Create Dataframe

Create a dataframe, `dfTweetsFull`, by looping through each file and pulling out specific elements of the tweets:
- tweet id
- author id
- text of the tweet
- date of the tweet
- retweet_count
- like_count

In [51]:
tweetList = []

for h in range(len(tweetListFull)):
    
    document = tweetListFull[h]
    
    for i in range(len(document)):

        tweets = tweetListFull[h][i]['tweets']

        for j in range(len(tweets)):

            singleTweetDict = {}

            # extract the information and add to single tweet dictionary
            singleTweetDict['tweet_id'] = tweetListFull[h][i]['tweets'][j]['id']
            singleTweetDict['author_id'] = tweetListFull[h][i]['tweets'][j]['author_id']
            singleTweetDict['text'] = tweetListFull[h][i]['tweets'][j]['text']
            singleTweetDict['date'] = tweetListFull[h][i]['tweets'][j]['created_at'].split(' ')[0]
            singleTweetDict['retweet_count'] = tweetListFull[h][i]['tweets'][j]['public_metrics']['retweet_count']
            singleTweetDict['reply_count'] = tweetListFull[h][i]['tweets'][j]['public_metrics']['reply_count']
            singleTweetDict['like_count'] = tweetListFull[h][i]['tweets'][j]['public_metrics']['like_count']

            # add the dictionary to tweet list
            tweetList.append(singleTweetDict)

In [48]:
dfTweetsFull = pd.DataFrame(tweetList)

In [52]:
dfTweetsFull.shape

(309987, 7)

<a name="sec3"></a>
## Drop Tweets with duplicate text

Drop tweets that are duplicated. Only the first instance will be kept.

In [57]:
dfTweetsNoDupe = dfTweetsFull.drop_duplicates(subset = "text",
                     keep = "first")

In [58]:
dfTweetsNoDupe.shape

(301151, 7)

<a name="sec4"></a>
## Create CSV

Create a CSV, `period_tweets.csv`, containing all of the tweets. 

In [63]:
dfTweetsNoDupe.to_csv('period_tweets.csv')