# Text-Based Ad Feedback Topic Modeling: Capstone Project<br>
<i>Developed by Rebeca Mahr -- Spring 2021</i>

* <strong>Objective</strong>: Create a model to identify topics among text-based video ad feedback from online campaign evaluation surveys.
<br>
<br>
* <strong>Background</strong>: Annual evaluation surveys were distributed among an advertising brand’s target audience. As part of the surveys, participants were shown three video ads that were flighted prior to the campaign evaluation. In response to each video ad shown in the surveys, participants were asked to explain what they think the main message of the video ad is. This data is used to inform campaign messaging comprehension and to inform creative updates for future message package flighting.
<br>
<br>
* <strong>Business utility</strong>: Qualitative feedback/textual responses from an advertising campaign’s target audience can be invaluable in determining ad and message comprehension as well as in making decisions for future rounds of campaign flighting.  However, human analysis of this data requires reading through thousands of responses and manually coding them into categories which is extremely time-consuming and subjective to the reviewer. Utilizing a machine learning algorithm to identify topics based on the textual responses can cut the amount of manual labor hours spent in analysis and minimize some bias due to subjectivity.
<br>
<br>
* <strong>Constraints</strong>: Data is based on survey participants who opt-into participating in research, so there is a level of sample selection bias involved in the algorithm results. Additionally, defining the model’s topic clusters still involves human interpretation, so this model requires some domain knowledge by the reviewer and cannot completely eliminate reviewer bias.
<br>
<br>
* <strong>Notes on Data</strong>:
    * This data was collected as part of surveys evaluating a nicotine vape prevention brand across two US states. The same three video ads were tested in each survey.
    * The data includes two data files:
        * Ad_Feedback_Text (response level - stacked): 
            * ID: Unique identifier for participant
            * Text: Includes all text responses
            * Ad: Indicates the specific video ad the response was for
        * Ad_Feedback_Demos (respondent level - flat): 
            * ID: Unique identifier for participant
            * CalculatedAge: Participant age calculated from birthdate
            * Race: Participant self-reported race
            * Gender: Participant self-reported gender
            * Segment: Audience segment the participant belongs to
            * Region: State where survey was distributed (encoded for client privacy)
            * Urban_Rural: Indicates whether participant indicated they live in an urban or rural area
    * Data sourced from Rescue Agency Public Benefit LLC ad tracking online surveys.
    * Data has been de-identified to protect participant privacy.
    * Data not publically available for compliance with IRB protocol requirements.
    * See b_Data_Sample/Sample_Dataset_TextBased_AdFeedback_TopicModeling.xlsx to view data format

# Data Wrangling

## Table of Contents
* [A: Load & Inspect Files](#a)
* [B: Merge Files](#b)
* [C: Initial Cleaning of Text](#c)

## A: Load & Inspect Files <a class="anchor" id="a"></a>

In [1]:
#Import libraries
import pandas as pd
import numpy as np
import string
import contractions

In [2]:
#Open files
#Feedback_file = pd.read_csv('../Data/a_Raw/Tweets_df.csv')
feedback_df = pd.read_csv('../Datafiles/AD_FEEDBACK_TEXT.csv')
demos_df = pd.read_csv('../Datafiles/AD_FEEDBACK_DEMOS.csv')

### Feedback File Review

In [3]:
#Inspect raw feedback file
print(feedback_df.info())
feedback_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1448 entries, 0 to 1447
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   ID        1448 non-null   object
 1   Text      1448 non-null   object
 2   Ad        1448 non-null   object
 3   Question  1448 non-null   object
dtypes: object(4)
memory usage: 45.4+ KB
None


Unnamed: 0,ID,Text,Ad,Question
0,R_31EnmC82PmXgfeJ,A lot of chemicals are in vapes that damage th...,DF,Main Message
1,R_2UVKzhgLoqvkyzk,A method of discouragement against vaping by i...,DD,Main Message
2,R_2PC3MSRmAF3ln2c,About the ingredients in vape,ST,Main Message
3,R_2X1neYEGvSteyEY,about the stuff that's in the vape juice,ST,Main Message
4,R_2rMIp9IjjI6Mzpi,Ads at vape stores are misleading. It makes yo...,ST,Main Message


In [4]:
#Check for duplicates
feedback_df.duplicated().value_counts()

False    1448
dtype: int64

#### Notes:
* 1,448 text responses
* No null values, although there may be some nonsense responses in text column


In [5]:
#Inspect Categories of Ad and Question Columns
print('\033[1m' + 'Ad Column Values'+'\033[0m')
print(feedback_df['Ad'].value_counts())
print('\033[1m' + 'Question Values'+'\033[0m')
print(feedback_df['Question'].value_counts())

[1mAd Column Values[0m
DF    724
ST    373
DD    351
Name: Ad, dtype: int64
[1mQuestion Values[0m
Main Message    1448
Name: Question, dtype: int64


#### Notes:
* Three Ads: DF, ST, and DD
* Question: Main Message is the only question, let's drop this

In [6]:
#Drop Question column
feedback_df = feedback_df.drop(columns='Question')

#Reset index
feedback_df = feedback_df.reset_index().drop(columns='index').copy()

### Demos File Review

In [7]:
#Inspect raw demos file
print(demos_df.info())
demos_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 724 entries, 0 to 723
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   ID             724 non-null    object
 1   CalculatedAge  724 non-null    int64 
 2   Race           724 non-null    object
 3   Gender         724 non-null    object
 4   Segment        724 non-null    int64 
 5   Region         724 non-null    int64 
 6   Urban_Rural    724 non-null    object
dtypes: int64(3), object(4)
memory usage: 39.7+ KB
None


Unnamed: 0,ID,CalculatedAge,Race,Gender,Segment,Region,Urban_Rural
0,R_1pA46rQimTRsAXQ,17,White Only,Male,5,1,Urban
1,R_1HdhEhdQGzgDkre,17,White Only,Male,5,1,Urban
2,R_Rab5MQj0TqGqV4R,18,White Only,Female,4,1,Rural
3,R_2uVXBpLsewn1Snv,16,White Only,Female,5,1,Urban
4,R_2ymgqdHCNQnDOru,18,White Only,Female,4,1,Rural


In [8]:
#Check for duplicates
demos_df.duplicated().value_counts()

False    724
dtype: int64

#### Note: 
* Sample is 724 survey participants
* No missing or duplicate data

In [9]:
#Rename age variable for clarity

rename_dict = {
    'CalculatedAge':'Age'
}

demos_df = demos_df.rename(columns = rename_dict)
demos_df.head()

Unnamed: 0,ID,Age,Race,Gender,Segment,Region,Urban_Rural
0,R_1pA46rQimTRsAXQ,17,White Only,Male,5,1,Urban
1,R_1HdhEhdQGzgDkre,17,White Only,Male,5,1,Urban
2,R_Rab5MQj0TqGqV4R,18,White Only,Female,4,1,Rural
3,R_2uVXBpLsewn1Snv,16,White Only,Female,5,1,Urban
4,R_2ymgqdHCNQnDOru,18,White Only,Female,4,1,Rural


In [10]:
#Look at range of values for each variable
#Inspect Categories of Ad and Question Columns
print('\033[1m' + 'Age Values'+'\033[0m')
print(demos_df['Age'].value_counts())
print('\033[1m' + 'Race Values'+'\033[0m')
print(demos_df['Race'].value_counts())
print('\033[1m' + 'Gender Values'+'\033[0m')
print(demos_df['Gender'].value_counts())
print('\033[1m' + 'Segment Values'+'\033[0m')
print(demos_df['Segment'].value_counts())
print('\033[1m' + 'Region Values'+'\033[0m')
print(demos_df['Region'].value_counts())
print('\033[1m' + 'Urban vs. Rural Values'+'\033[0m')
print(demos_df['Urban_Rural'].value_counts())

[1mAge Values[0m
18    235
17    174
16    142
15     90
14     45
19     20
13     18
Name: Age, dtype: int64
[1mRace Values[0m
White Only                    523
Black Only                     65
Hispanic/Latino ANY            60
Two or More - Non Hispanic     52
Asian Only                     17
Other                           7
Name: Race, dtype: int64
[1mGender Values[0m
Female    462
Male      262
Name: Gender, dtype: int64
[1mSegment Values[0m
5    197
1    186
4    140
2     83
6     62
3     56
Name: Segment, dtype: int64
[1mRegion Values[0m
2    390
1    334
Name: Region, dtype: int64
[1mUrban vs. Rural Values[0m
Urban    510
Rural    214
Name: Urban_Rural, dtype: int64


#### Notes: 
* Demo variables are for interpreting topics among audience, but secondary to this analysis
* Will just encode the dichotomous variables for now

In [11]:
#One hot-encode Urban_Rural using pd.dummies
demos_df[['Rural','Urban']] = pd.get_dummies(demos_df['Urban_Rural'])
#Check work
demos_df.head()

Unnamed: 0,ID,Age,Race,Gender,Segment,Region,Urban_Rural,Rural,Urban
0,R_1pA46rQimTRsAXQ,17,White Only,Male,5,1,Urban,0,1
1,R_1HdhEhdQGzgDkre,17,White Only,Male,5,1,Urban,0,1
2,R_Rab5MQj0TqGqV4R,18,White Only,Female,4,1,Rural,1,0
3,R_2uVXBpLsewn1Snv,16,White Only,Female,5,1,Urban,0,1
4,R_2ymgqdHCNQnDOru,18,White Only,Female,4,1,Rural,1,0


In [12]:
#Drop 'Urban_Rural' & 'Rural' to just keep 'Urban'

demos_df = demos_df.drop(columns=['Urban_Rural','Rural'])
demos_df.head()

Unnamed: 0,ID,Age,Race,Gender,Segment,Region,Urban
0,R_1pA46rQimTRsAXQ,17,White Only,Male,5,1,1
1,R_1HdhEhdQGzgDkre,17,White Only,Male,5,1,1
2,R_Rab5MQj0TqGqV4R,18,White Only,Female,4,1,0
3,R_2uVXBpLsewn1Snv,16,White Only,Female,5,1,1
4,R_2ymgqdHCNQnDOru,18,White Only,Female,4,1,0


In [13]:
#One hot-encode Gender using pd.dummies
demos_df[['Female','Male']] = pd.get_dummies(demos_df['Gender'])
#Check work
demos_df.head()

Unnamed: 0,ID,Age,Race,Gender,Segment,Region,Urban,Female,Male
0,R_1pA46rQimTRsAXQ,17,White Only,Male,5,1,1,0,1
1,R_1HdhEhdQGzgDkre,17,White Only,Male,5,1,1,0,1
2,R_Rab5MQj0TqGqV4R,18,White Only,Female,4,1,0,1,0
3,R_2uVXBpLsewn1Snv,16,White Only,Female,5,1,1,1,0
4,R_2ymgqdHCNQnDOru,18,White Only,Female,4,1,0,1,0


In [14]:
#Drop 'Gender' & 'Male' to just keep 'Female'

demos_df = demos_df.drop(columns=['Gender','Male'])
demos_df.head()

Unnamed: 0,ID,Age,Race,Segment,Region,Urban,Female
0,R_1pA46rQimTRsAXQ,17,White Only,5,1,1,0
1,R_1HdhEhdQGzgDkre,17,White Only,5,1,1,0
2,R_Rab5MQj0TqGqV4R,18,White Only,4,1,0,1
3,R_2uVXBpLsewn1Snv,16,White Only,5,1,1,1
4,R_2ymgqdHCNQnDOru,18,White Only,4,1,0,1


In [15]:
#One hot-encode Region using pd.dummies
demos_df[['Region_1','Region_2']] = pd.get_dummies(demos_df['Region'])
#Check work
demos_df.head()

Unnamed: 0,ID,Age,Race,Segment,Region,Urban,Female,Region_1,Region_2
0,R_1pA46rQimTRsAXQ,17,White Only,5,1,1,0,1,0
1,R_1HdhEhdQGzgDkre,17,White Only,5,1,1,0,1,0
2,R_Rab5MQj0TqGqV4R,18,White Only,4,1,0,1,1,0
3,R_2uVXBpLsewn1Snv,16,White Only,5,1,1,1,1,0
4,R_2ymgqdHCNQnDOru,18,White Only,4,1,0,1,1,0


In [16]:
#Drop 'Region' & 'Region_2' to just keep 'Region_1'

demos_df = demos_df.drop(columns=['Region','Region_2'])
demos_df.head()

Unnamed: 0,ID,Age,Race,Segment,Urban,Female,Region_1
0,R_1pA46rQimTRsAXQ,17,White Only,5,1,0,1
1,R_1HdhEhdQGzgDkre,17,White Only,5,1,0,1
2,R_Rab5MQj0TqGqV4R,18,White Only,4,0,1,1
3,R_2uVXBpLsewn1Snv,16,White Only,5,1,1,1
4,R_2ymgqdHCNQnDOru,18,White Only,4,0,1,1


In [17]:
#Save processed demos_df
demos_df.to_csv('../Datafiles/AD_FEEDBACK_DEMOS_Processed.csv', index= False)

## B: Merge Files <a class="anchor" id="b"></a>

In [18]:
#Add demo data to feedback file to create full file for analysis
full_df = pd.merge(feedback_df, demos_df, on='ID', how='left')

In [19]:
#Inspect merged df
full_df.head()

Unnamed: 0,ID,Text,Ad,Age,Race,Segment,Urban,Female,Region_1
0,R_31EnmC82PmXgfeJ,A lot of chemicals are in vapes that damage th...,DF,17,White Only,1,1,0,1
1,R_2UVKzhgLoqvkyzk,A method of discouragement against vaping by i...,DD,18,White Only,4,0,0,0
2,R_2PC3MSRmAF3ln2c,About the ingredients in vape,ST,18,White Only,1,1,1,0
3,R_2X1neYEGvSteyEY,about the stuff that's in the vape juice,ST,17,White Only,5,1,1,0
4,R_2rMIp9IjjI6Mzpi,Ads at vape stores are misleading. It makes yo...,ST,18,Two or More - Non Hispanic,4,1,1,0


In [20]:
full_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1448 entries, 0 to 1447
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   ID        1448 non-null   object
 1   Text      1448 non-null   object
 2   Ad        1448 non-null   object
 3   Age       1448 non-null   int64 
 4   Race      1448 non-null   object
 5   Segment   1448 non-null   int64 
 6   Urban     1448 non-null   uint8 
 7   Female    1448 non-null   uint8 
 8   Region_1  1448 non-null   uint8 
dtypes: int64(2), object(4), uint8(3)
memory usage: 83.4+ KB


In [21]:
#Check for duplicate rows
full_df.duplicated().value_counts()

False    1448
dtype: int64

## C. Initial Cleaning of Text <a class="anchor" id="c"></a>

In [22]:
#Rename Text column to Text_Original

clean_df = full_df.rename(columns = {'Text':'Text_Original'})
clean_df.head()

Unnamed: 0,ID,Text_Original,Ad,Age,Race,Segment,Urban,Female,Region_1
0,R_31EnmC82PmXgfeJ,A lot of chemicals are in vapes that damage th...,DF,17,White Only,1,1,0,1
1,R_2UVKzhgLoqvkyzk,A method of discouragement against vaping by i...,DD,18,White Only,4,0,0,0
2,R_2PC3MSRmAF3ln2c,About the ingredients in vape,ST,18,White Only,1,1,1,0
3,R_2X1neYEGvSteyEY,about the stuff that's in the vape juice,ST,17,White Only,5,1,1,0
4,R_2rMIp9IjjI6Mzpi,Ads at vape stores are misleading. It makes yo...,ST,18,Two or More - Non Hispanic,4,1,1,0


#### Expanding Contractions

Source: Code adapted from Towards Data Science article posted by Kamil Mysiak at https://towardsdatascience.com/preprocessing-text-data-using-python-576206753c28 retrieved in January, 2021.

In [23]:
#Creates var as list with contractions expanded
clean_df['no_contract'] = clean_df['Text_Original'].apply(lambda x: [contractions.fix(word) for word in x.split()])
clean_df.head()

Unnamed: 0,ID,Text_Original,Ad,Age,Race,Segment,Urban,Female,Region_1,no_contract
0,R_31EnmC82PmXgfeJ,A lot of chemicals are in vapes that damage th...,DF,17,White Only,1,1,0,1,"[A, lot, of, chemicals, are, in, vapes, that, ..."
1,R_2UVKzhgLoqvkyzk,A method of discouragement against vaping by i...,DD,18,White Only,4,0,0,0,"[A, method, of, discouragement, against, vapin..."
2,R_2PC3MSRmAF3ln2c,About the ingredients in vape,ST,18,White Only,1,1,1,0,"[About, the, ingredients, in, vape]"
3,R_2X1neYEGvSteyEY,about the stuff that's in the vape juice,ST,17,White Only,5,1,1,0,"[about, the, stuff, that is, in, the, vape, ju..."
4,R_2rMIp9IjjI6Mzpi,Ads at vape stores are misleading. It makes yo...,ST,18,Two or More - Non Hispanic,4,1,1,0,"[Ads, at, vape, stores, are, misleading., It, ..."


In [24]:
#Joins list column 'no_contract' back into a single string
clean_df['Text_Clean'] = [' '.join(map(str,l)) for l in clean_df['no_contract']]
clean_df.head()

Unnamed: 0,ID,Text_Original,Ad,Age,Race,Segment,Urban,Female,Region_1,no_contract,Text_Clean
0,R_31EnmC82PmXgfeJ,A lot of chemicals are in vapes that damage th...,DF,17,White Only,1,1,0,1,"[A, lot, of, chemicals, are, in, vapes, that, ...",A lot of chemicals are in vapes that damage th...
1,R_2UVKzhgLoqvkyzk,A method of discouragement against vaping by i...,DD,18,White Only,4,0,0,0,"[A, method, of, discouragement, against, vapin...",A method of discouragement against vaping by i...
2,R_2PC3MSRmAF3ln2c,About the ingredients in vape,ST,18,White Only,1,1,1,0,"[About, the, ingredients, in, vape]",About the ingredients in vape
3,R_2X1neYEGvSteyEY,about the stuff that's in the vape juice,ST,17,White Only,5,1,1,0,"[about, the, stuff, that is, in, the, vape, ju...",about the stuff that is in the vape juice
4,R_2rMIp9IjjI6Mzpi,Ads at vape stores are misleading. It makes yo...,ST,18,Two or More - Non Hispanic,4,1,1,0,"[Ads, at, vape, stores, are, misleading., It, ...",Ads at vape stores are misleading. It makes yo...


In [25]:
#Delete no_contract column
clean_df.drop('no_contract', axis=1, inplace=True)

In [26]:
#Check nulls again
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1448 entries, 0 to 1447
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   ID             1448 non-null   object
 1   Text_Original  1448 non-null   object
 2   Ad             1448 non-null   object
 3   Age            1448 non-null   int64 
 4   Race           1448 non-null   object
 5   Segment        1448 non-null   int64 
 6   Urban          1448 non-null   uint8 
 7   Female         1448 non-null   uint8 
 8   Region_1       1448 non-null   uint8 
 9   Text_Clean     1448 non-null   object
dtypes: int64(2), object(5), uint8(3)
memory usage: 94.7+ KB


#### Remove all non-letter/number characters

In [27]:
clean_df['Text_Clean'] = clean_df['Text_Clean'].str.replace('[^a-zA-Z 0-9]', ' ')
clean_df.sort_values(by='Text_Clean')

Unnamed: 0,ID,Text_Original,Ad,Age,Race,Segment,Urban,Female,Region_1,Text_Clean
0,R_31EnmC82PmXgfeJ,A lot of chemicals are in vapes that damage th...,DF,17,White Only,1,1,0,1,A lot of chemicals are in vapes that damage th...
1,R_2UVKzhgLoqvkyzk,A method of discouragement against vaping by i...,DD,18,White Only,4,0,0,0,A method of discouragement against vaping by i...
2,R_2PC3MSRmAF3ln2c,About the ingredients in vape,ST,18,White Only,1,1,1,0,About the ingredients in vape
4,R_2rMIp9IjjI6Mzpi,Ads at vape stores are misleading. It makes yo...,ST,18,Two or More - Non Hispanic,4,1,1,0,Ads at vape stores are misleading It makes yo...
5,R_UnnyfHhm6BP8r4Z,"Again, to scare teens from vaping.",DF,17,White Only,1,0,1,0,Again to scare teens from vaping
...,...,...,...,...,...,...,...,...,...,...
1429,R_2c1tgvajRrxXVHQ,you are more susceptible to viruses,DF,18,Hispanic/Latino ANY,1,0,0,0,you are more susceptible to viruses
1447,R_1fjiNCL1NCK36nM,youre vulnerable if you vape,DF,16,White Only,1,1,1,0,you are vulnerable if you vape
1440,R_1jvsFSGMemQmogz,you shouldn’t vape because of how damaging it is,DF,15,Black Only,6,1,1,0,you should not vape because of how damaging it is
1442,R_yD55zMWP70MfIJj,young adults should stop vaping and should onl...,ST,18,White Only,1,1,1,1,young adults should stop vaping and should onl...


In [28]:
#Filter to cases with text
clean_df = clean_df[~clean_df['Text_Clean'].str.isspace()]
print(clean_df.info())
clean_df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1448 entries, 0 to 1447
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   ID             1448 non-null   object
 1   Text_Original  1448 non-null   object
 2   Ad             1448 non-null   object
 3   Age            1448 non-null   int64 
 4   Race           1448 non-null   object
 5   Segment        1448 non-null   int64 
 6   Urban          1448 non-null   uint8 
 7   Female         1448 non-null   uint8 
 8   Region_1       1448 non-null   uint8 
 9   Text_Clean     1448 non-null   object
dtypes: int64(2), object(5), uint8(3)
memory usage: 94.7+ KB
None


Unnamed: 0,ID,Text_Original,Ad,Age,Race,Segment,Urban,Female,Region_1,Text_Clean
0,R_31EnmC82PmXgfeJ,A lot of chemicals are in vapes that damage th...,DF,17,White Only,1,1,0,1,A lot of chemicals are in vapes that damage th...
1,R_2UVKzhgLoqvkyzk,A method of discouragement against vaping by i...,DD,18,White Only,4,0,0,0,A method of discouragement against vaping by i...
2,R_2PC3MSRmAF3ln2c,About the ingredients in vape,ST,18,White Only,1,1,1,0,About the ingredients in vape
3,R_2X1neYEGvSteyEY,about the stuff that's in the vape juice,ST,17,White Only,5,1,1,0,about the stuff that is in the vape juice
4,R_2rMIp9IjjI6Mzpi,Ads at vape stores are misleading. It makes yo...,ST,18,Two or More - Non Hispanic,4,1,1,0,Ads at vape stores are misleading It makes yo...


In [29]:
#Reset Index
clean_df.reset_index(drop=True, inplace=True)

In [30]:
#Remove leading and trailing spaces from Text_Clean
clean_df['Text_Clean'] = clean_df['Text_Clean'].str.strip()

#### Make all lowercase

In [31]:
clean_df['Text_Clean'] = clean_df['Text_Clean'].str.lower()
clean_df.head()

Unnamed: 0,ID,Text_Original,Ad,Age,Race,Segment,Urban,Female,Region_1,Text_Clean
0,R_31EnmC82PmXgfeJ,A lot of chemicals are in vapes that damage th...,DF,17,White Only,1,1,0,1,a lot of chemicals are in vapes that damage th...
1,R_2UVKzhgLoqvkyzk,A method of discouragement against vaping by i...,DD,18,White Only,4,0,0,0,a method of discouragement against vaping by i...
2,R_2PC3MSRmAF3ln2c,About the ingredients in vape,ST,18,White Only,1,1,1,0,about the ingredients in vape
3,R_2X1neYEGvSteyEY,about the stuff that's in the vape juice,ST,17,White Only,5,1,1,0,about the stuff that is in the vape juice
4,R_2rMIp9IjjI6Mzpi,Ads at vape stores are misleading. It makes yo...,ST,18,Two or More - Non Hispanic,4,1,1,0,ads at vape stores are misleading it makes yo...


#### Double check for any remaining punctuation

In [32]:
for i in clean_df['Text_Clean']:
    if i in string.punctuation:
        print(i)

#### Check Text for Gibberish Responses
* Code adapted from https://github.com/rrenaud/Gibberish-Detector
* Steps: 
    * Installed using instructions here: https://pypi.org/project/gibberish-detector/
    * Created a folder called 'Examples' with big.txt file from the repository above saved in it within my Home drive (..RR_SD)
    * Ran the following in the terminal:
        * pip install gibberish-detector
        * gibberish-detector train examples/big.txt > big.model
    * Then manually moved the big.model file to this folder to call for detector

In [33]:
#Import package and create detector from model
from gibberish_detector import detector
Detector = detector.create_from_model('big.model')

In [34]:
#Identity gibberish text in df
Gib_Words = []
for i in clean_df['Text_Clean']:
    result = Detector.is_gibberish(i)
    Gib_Words.append(result)
clean_df['Gib_Words'] = Gib_Words
clean_df[clean_df['Gib_Words'] == True]

Unnamed: 0,ID,Text_Original,Ad,Age,Race,Segment,Urban,Female,Region_1,Text_Clean,Gib_Words
247,R_2UgUtnGBATLiPX0,Idk,DF,15,White Only,3,0,1,1,idk,True
248,R_SNaT2CciuQGwgLL,Idk,DF,15,White Only,5,1,0,1,idk,True
249,R_2B8M1hIcmTxepMw,idk,DF,15,White Only,2,0,1,0,idk,True


In [35]:
#Rename idk to 'do not know'
clean_df['Text_Clean'] = clean_df['Text_Clean'].replace('idk','do not know')

#Run Gibberish detector again
Gib_Words = []
for i in clean_df['Text_Clean']:
    result = Detector.is_gibberish(i)
    Gib_Words.append(result)
clean_df['Gib_Words'] = Gib_Words
clean_df[clean_df['Gib_Words'] == True]

Unnamed: 0,ID,Text_Original,Ad,Age,Race,Segment,Urban,Female,Region_1,Text_Clean,Gib_Words


In [36]:
#Delete Gib_Words
clean_df = clean_df.drop(columns='Gib_Words')

In [37]:
#Final Check
print(clean_df.info())
clean_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1448 entries, 0 to 1447
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   ID             1448 non-null   object
 1   Text_Original  1448 non-null   object
 2   Ad             1448 non-null   object
 3   Age            1448 non-null   int64 
 4   Race           1448 non-null   object
 5   Segment        1448 non-null   int64 
 6   Urban          1448 non-null   uint8 
 7   Female         1448 non-null   uint8 
 8   Region_1       1448 non-null   uint8 
 9   Text_Clean     1448 non-null   object
dtypes: int64(2), object(5), uint8(3)
memory usage: 83.6+ KB
None


Unnamed: 0,ID,Text_Original,Ad,Age,Race,Segment,Urban,Female,Region_1,Text_Clean
0,R_31EnmC82PmXgfeJ,A lot of chemicals are in vapes that damage th...,DF,17,White Only,1,1,0,1,a lot of chemicals are in vapes that damage th...
1,R_2UVKzhgLoqvkyzk,A method of discouragement against vaping by i...,DD,18,White Only,4,0,0,0,a method of discouragement against vaping by i...
2,R_2PC3MSRmAF3ln2c,About the ingredients in vape,ST,18,White Only,1,1,1,0,about the ingredients in vape
3,R_2X1neYEGvSteyEY,about the stuff that's in the vape juice,ST,17,White Only,5,1,1,0,about the stuff that is in the vape juice
4,R_2rMIp9IjjI6Mzpi,Ads at vape stores are misleading. It makes yo...,ST,18,Two or More - Non Hispanic,4,1,1,0,ads at vape stores are misleading it makes yo...


In [38]:
#Save cleaned file to CSV.
clean_df.to_csv('../Datafiles/Feedback_df_clean.csv', index=False) 