# 2. Creating Training and Testing Datasets


**Author:** Tori Stiegman   
**Project:** Gender-Inclusive Language in Tweets about Menstruation   
**Date turned in:** Dec 19, 2022                                                                          
**Updated:** Feb 28, 2023

**About this notebook:** In this notebook, I will use cluster sampling to create a training data set. I will use this data set to train my classifier in future notebooks. 

**Table of Contents**
1. [Load Data](#sec1)
2. [Create Filtered Dataset](#sec2)
3. [Cluster Sample to create Full Training Dataset](#sec3)
4. [Remove Training rows from Full Dataset](#sec4)

In [46]:
import advertools as adv
import nltk
from nltk.corpus import stopwords
# nltk.download('stopwords')
from nltk.stem import PorterStemmer

import pandas as pd
# pd.set_option('display.max_colwidth', None)

# get rid of warnings pls
import warnings
warnings.filterwarnings('ignore')

<a name="sec1"></a>
## Load Data

Load in the data from `period_tweets.csv` and create a duplicate, `dfFullClean`, that we can clean throughout the notebook. 

In [29]:
pd.set_option('display.max_colwidth', -1)
dfFull = pd.read_csv('period_tweets.csv')
dfFull = dfFull.drop('Unnamed: 0', axis = 1)

dfFullClean = dfFull

len(dfFull)

301155

Turn the text column into a string, just to be extra safe. 

In [30]:
makeString = lambda text: str(text)

dfFullClean['text'] = dfFull['text'].apply(makeString)

In [1]:
# dfFullClean.head()

<a name="sec2"></a>
## Create Filtered Dataset

Create a new dataframe, `dfFiltered`, which will only contain tweets that mention any of the keywords:

Keywords:
- queer
- gender (will include misgender, gender neutral, genderqueer, agender, etc.)
- lgbt (will include lgbt, lgbtq, lgbtq+, lgbtqia by nature of code...)
- trans (including transgender, transmasculine, etc.)
- nonbinary
- dysphoria
- androgynous
- afab
- amab
- enby
- they
- them

In [48]:
keywords = dfFull['text'].str.contains('queer', na=False) | \
           dfFull['text'].str.contains('gender', na=False) | \
           dfFull['text'].str.contains('lgbt', na=False) | \
           dfFull['text'].str.contains('trans', na=False) | \
           dfFull['text'].str.contains('nonbinary', na=False) | \
           dfFull['text'].str.contains('dysphoria', na=False) | \
           dfFull['text'].str.contains('androgynous', na=False) | \
           dfFull['text'].str.contains('afab', na=False) | \
           dfFull['text'].str.contains('amab', na=False) | \
           dfFull['text'].str.contains('enby', na=False)
        

dfFiltered = dfFullClean[keywords]
len(dfFiltered)
# dfFiltered.head()


10319

<a name="sec3"></a>
## Cluster Sample to Create Full Training Dataset

I will use cluster sampling to generate a random sample of tweets. These tweets will then be labeled as gender- inclusive, neutral or exclusive. This labeled dataset will then be used to train a model. 

### Filtered Random Sample
Create random sample from filtered dataset (n = 140)

In [2]:
import random

random.seed(1234)


filtered_sample_df = dfFiltered.sample(n = 140)
# filtered_sample_df.head()

### Full Random Sample
Create a random sample from the full dataset (n = 140)

In [34]:
random.seed(12345)

full_sample_df = dfFull.sample(n = 140)

### Combining into Training Dataset
Combine the full random sample and the filtered random sample to create the whole random sample (n = 280).

In [3]:
training_data_df = pd.concat([full_sample_df, filtered_sample_df]).reset_index().drop('index', axis = 1)
training_data_df.head()

### Export as CSV for Labeling...

In [36]:
training_data_df.to_csv('training_unlabeled.csv', index = False, header = True)

<a name="sec4"></a>
## Remove Training rows from Full Dataset

Here I will remove the rows that were included in the training dataset from the full dataset. 


In [37]:
dfRemoved1 = pd.concat([dfFullClean, training_data_df]).drop_duplicates(keep=False)

In [38]:
dfRemoved1.shape

(300874, 7)

In [39]:
dfFullClean.shape

(301155, 7)

## Create Test Dataset

### Filter the data once again... 

Create a new dataframe, `dfFiltered`, which will only contain tweets that mention any of the keywords:

Keywords:
- queer
- gender (will include misgender, gender neutral, genderqueer, agender, etc.)
- lgbt (will include lgbt, lgbtq, lgbtq+, lgbtqia by nature of code...)
- trans (including transgender, transmasculine, etc.)
- nonbinary
- dysphoria
- androgynous
- afab
- amab
- enby
- they
- them

In [5]:
keywords = dfRemoved1['text'].str.contains('queer', na=False) | \
           dfRemoved1['text'].str.contains('gender', na=False) | \
           dfRemoved1['text'].str.contains('lgbt', na=False) | \
           dfRemoved1['text'].str.contains('trans', na=False) | \
           dfRemoved1['text'].str.contains('nonbinary', na=False) | \
           dfRemoved1['text'].str.contains('dysphoria', na=False) | \
           dfRemoved1['text'].str.contains('androgynous', na=False) | \
           dfRemoved1['text'].str.contains('afab', na=False) | \
           dfRemoved1['text'].str.contains('amab', na=False) | \
           dfRemoved1['text'].str.contains('enby', na=False)
        

dfFiltered1 = dfRemoved1[keywords]
len(dfFiltered1)
dfFiltered1.head()

<a name="sec____"></a>
### Cluster Sample to Create Full Test Dataset

I will use cluster sampling to generate a random sample of tweets. These tweets will then be labeled as gender- inclusive, neutral or exclusive. This labeled dataset will then be used to test a model. 

#### Filtered Random Sample
Create random sample from filtered dataset (n = 28)

In [6]:
random.seed(1234)

Test_filtered_sample_df = dfFiltered1.sample(n = 28)
Test_filtered_sample_df.head()

#### Full Random Sample
Create a random sample from the full dataset (n = 28)

In [25]:
random.seed(12345)

Test_full_sample_df = dfFull.sample(n = 28)

#### Combining into Test Dataset
Combine the full random sample and the filtered random sample to create the whole random sample (n = 56).

In [27]:
test_data_df = pd.concat([Test_full_sample_df, Test_filtered_sample_df]).reset_index().drop('index', axis = 1)
test_data_df.head()
test_data_df.shape

(56, 7)

#### Export test dataset for labeling

In [41]:
test_data_df.to_csv("test_unlabeled.csv")

### Create `NoTrainTest_fullTweets.csv`

In [40]:
dfRemoved2 = pd.concat([dfRemoved1, test_data_df]).drop_duplicates(keep=False)

In [42]:
dfRemoved2.shape

(300818, 7)

In [44]:
dfRemoved2.to_csv('noTrainTest_fullTweets.csv', index = False, header = True)