# 3. Cleaning My Period Tracking Data


**Author:** Tori Stiegman   
**Project:** Gender-Inclusive Language in Tweets about Menstruation   
**Date turned in:** Dec 19, 2022

**About this notebook:** In this notebook I will go thorugh the process of cleaning and preparing the Twitter data I extracted in an eariler notebook. 


**Table of Contents**
1. [Load Data](#data)
2. [Remove Emojis](#emoji)
3. [Make All Letters Lowercase](#lower)
4. [Remove Hashtags and Mentions)](#hashtag)
5. [Remove links](#link)
6. [Remove punctuations and non-alphanumeric characters](#punct)
7. [Tokenization](#token)
8. [Stemming](#stem)
9. [Export to CSV](#csv)

In [1]:
import json
import tweepy
import numpy as np
import advertools as adv
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem import PorterStemmer

import pandas as pd
# pd.set_option('display.max_colwidth', None)

# get rid of warnings pls
import warnings
warnings.filterwarnings('ignore')

<a name="data"></a>
## Load Data

Here I will load in my data, which includes:
- `noTrain_fullTweets.csv`: CSV with full twitter data excluding the training and testing sets
- `2023_training_labeled.csv`: CSV with labeled training set
- `2023_test_labeled.csv`: CSV with labeled testing set

In [52]:
# so we can see the whole tweet text
pd.set_option('display.max_colwidth', -1)

Load in the full dataset

In [53]:
dfFull = pd.read_csv('noTrainTest_fullTweets.csv')
dfFull.head()

# create a duplicate dataset that we can clean
dfFullClean = dfFull

Load in training dataset

In [54]:
dfTraining = pd.read_csv("2023_training_labeled.csv")

trainClean = dfTraining

Load in testing dataset

In [55]:
dfTest = pd.read_csv("2023_test_labeled.csv")

testClean = dfTest

Make a new column, `text_clean`, that I will clean throughout the rest of the notebook. 

In [56]:
makeString = lambda text: str(text)

dfFullClean['text_clean'] = dfFull['text'].apply(makeString)
trainClean['text_clean'] = trainClean['text'].apply(makeString)
testClean['text_clean'] = testClean['text'].apply(makeString)

In [57]:
# dfFullClean.head()

<a name="emoji"></a>
## Remove Emojis

Remove emojis from tweets and replace with a blank space, ""

In [58]:
# helper function
emoji_pattern = re.compile("["
                        u"\U0001F600-\U0001F64F"  # emoticons
                        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                        u"\U0001F680-\U0001F6FF"  # transport & map symbols
                        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                        u"\U00002702-\U000027B0"
                        u"\U000024C2-\U0001F251"
                        u"\U0001f926-\U0001f937"
                        u"\U0001F1F2"
                        u"\U0001F1F4"
                        u"\U0001F620"
                        u"\u200d"
                        u"\u2640-\u2642"
                        u"\u2600-\u2B55"
                        u"\u23cf"
                        u"\u23e9"
                        u"\u231a"
                        u"\ufe0f"  # dingbats
                        u"\u3030"
                        u"\U00002500-\U00002BEF"  # Chinese char
                        u"\U00010000-\U0010ffff"
                        "]+", flags=re.UNICODE)

emoji = lambda text: re.sub(emoji_pattern,"", text)

In [59]:
dfFullClean['text_clean'] = dfFull['text_clean'].apply(emoji)
trainClean['text_clean'] = trainClean['text_clean'].apply(emoji)
testClean['text_clean'] = testClean['text_clean'].apply(emoji)

<a name="lower"></a>
## Make all Letters Lowercase

In [60]:
# helper function
lower = lambda text: text.lower()

dfFullClean['text_clean'] = dfFullClean['text_clean'].apply(lower)
trainClean['text_clean'] = trainClean['text_clean'].apply(lower)
testClean['text_clean'] = testClean['text_clean'].apply(lower)

<a name="hashtag"></a>
## Remove Hashtags and Mentions

Remove hashtags and Twitter user mentions from tweets and replace with a blank space, ""

In [2]:
# helper functions
mentions = lambda text: re.sub("@[A-Za-z0-9_]+","", text)
hashtags = lambda text: re.sub("#[A-Za-z0-9_]+","", text)

# apply the helper functions
dfFullClean['text_clean'] = dfFullClean['text_clean'].apply(mentions)
dfFullClean['text_clean'] = dfFullClean['text_clean'].apply(hashtags)

trainClean['text_clean'] = trainClean['text_clean'].apply(mentions)
trainClean['text_clean'] = trainClean['text_clean'].apply(hashtags)

testClean['text_clean'] = testClean['text_clean'].apply(mentions)
testClean['text_clean'] = testClean['text_clean'].apply(hashtags)

dfFullClean.head(10)

<a name="link"></a>
## Remove links

Remove hyperlinks from tweets and replace with a blank space, ""

In [3]:
# helper functions
http = lambda text: re.sub(r"http\S+", "", text)
www = lambda text: re.sub(r"www.\S+", "", text)

# apply helper functions
dfFullClean['text_clean'] = dfFullClean['text_clean'].apply(http)
dfFullClean['text_clean'] = dfFullClean['text_clean'].apply(www)

trainClean['text_clean'] = trainClean['text_clean'].apply(http)
trainClean['text_clean'] = trainClean['text_clean'].apply(www)

testClean['text_clean'] = testClean['text_clean'].apply(http)
testClean['text_clean'] = testClean['text_clean'].apply(www)

dfFullClean.head(10)

<a name="punct"></a>
## Remove punctuations and non-alphanumeric characters

Remove punctuation and non-alphanumeric characters from tweets and replace with a blank space, ""

In [67]:
# helper functions
one = lambda text: re.sub('[()!?&]', ' ', text) ## May not want to remove exclamation points...
two = lambda text: re.sub('\[.*?\-]',' ', text)
nonAlpha = lambda text: re.sub("[^a-z0-9]"," ", text)

# apply helper functions
dfFullClean['text_clean'] = dfFullClean['text_clean'].apply(one)
dfFullClean['text_clean'] = dfFullClean['text_clean'].apply(two)
dfFullClean['text_clean'] = dfFullClean['text_clean'].apply(nonAlpha)

trainClean['text_clean'] = trainClean['text_clean'].apply(one)
trainClean['text_clean'] = trainClean['text_clean'].apply(two)
trainClean['text_clean'] = trainClean['text_clean'].apply(nonAlpha)

testClean['text_clean'] = testClean['text_clean'].apply(one)
testClean['text_clean'] = testClean['text_clean'].apply(two)
testClean['text_clean'] = testClean['text_clean'].apply(nonAlpha)

# dfFullClean.head()

<a name="token"></a>
## Tokenization

Turn each word in the sentence into a "token."

In [68]:
# !pip install nltk

In [69]:
from nltk.tokenize import TweetTokenizer
from nltk.tokenize import word_tokenize

In [71]:
# helper function
tt = TweetTokenizer()
token = lambda text: tt.tokenize(text)

# apply function
dfFullClean['text_clean'] = dfFullClean['text_clean'].apply(makeString).apply(token)
trainClean['text_clean'] = trainClean['text_clean'].apply(makeString).apply(token)
testClean['text_clean'] = testClean['text_clean'].apply(makeString).apply(token)

In [4]:
# trainClean.head(10)

<a name="stem"></a>
## Stemming

Reduce each word to its stem that affixes to suffixes and prefixes or to the roots of words known as "lemmas."
This will be very helpful when feeding the tweets into a model. 

### Specify stop words

Specify stop words, or short words, to get rid of while stemming. 

I will keep these stop words since they are related to my query: 
"he", "him", "his", "himself", "she", "she", "she's", "her", "hers", "herself", "they", "them", "their", "theirs", "themselves"

In [74]:
stop_words=stopwords.words('english')
stop_words_og = stop_words
stemmer=PorterStemmer()

pronouns = ["he", "him", "his", "himself", "she", "she", "she's", "her", "hers", "herself", "they", "them", "their", "theirs", "themselves"]

for word in pronouns:
    if word in stop_words:
        stop_words.remove(word)

### Stem each word

In [75]:
# helper function
stem = lambda text: [stemmer.stem(word) for word in text if (word not in stop_words)]

# apply function
dfFullClean['text_clean'] = dfFullClean['text_clean'].apply(stem)
trainClean['text_clean'] = trainClean['text_clean'].apply(stem)
testClean['text_clean'] = testClean['text_clean'].apply(stem)

In [77]:
# trainClean.head()

### Join stemmed list together

Join list of stemmed words together to create a stemmed phrase

In [78]:
# Helper function
join = lambda lst: ' '.join(lst)

# apply helper function
dfFullClean['text_clean'] = dfFullClean['text_clean'].apply(join)
trainClean['text_clean'] = trainClean['text_clean'].apply(join)
testClean['text_clean'] = testClean['text_clean'].apply(join)

In [5]:
# dfFullClean.head()
# trainClean.head()

<a name="csv"></a>
## Export to CSV

Create six files:

The first three will be used for my first Naive Bayes Model:
1. `fullTwitter_clean.csv`
2. `train_clean.csv`
3. `test_clean.csv`

The next three will be used for the new Naive Bayes model and the classification tree:          
4. `fullTwitter_clean_extras.csv`                               
5. `train_clean_extras.csv`                                    
6. `test_clean_extras.csv`

In [80]:
dfFullText = dfFullClean.loc[:,["text", "tweet_id", "text_clean"]]
dfFullText.to_csv('fullTwitter_clean.csv', index = False, header = True)

In [81]:
trainClean_text = trainClean.loc[:,["text", "label", "tweet_id", "text_clean"]]
trainClean_text.to_csv('train_clean.csv', index = False, header = True)

In [82]:
testClean_text = testClean.loc[:,["text", "label", "tweet_id", "text_clean"]]
testClean_text.to_csv('test_clean.csv', index = False, header = True)

___________________________________________________________________________________

In [83]:
dfFullText_extras = dfFullClean.loc[:,["text", "tweet_id", "text_clean", "date", "like_count"]]
dfFullText_extras.to_csv('fullTwitter_clean_extras.csv', index = False, header = True)

In [84]:
trainClean_text_extras = trainClean.loc[:,["text", "label", "tweet_id", "text_clean", "date", "like_count"]]
trainClean_text_extras.to_csv('train_clean_extras.csv', index = False, header = True)

In [85]:
testClean_text_extras = testClean.loc[:,["text", "label", "tweet_id", "text_clean", "date", "like_count"]]
testClean_text_extras.to_csv('test_clean_extras.csv', index = False, header = True)