# Survey Analysis

## Import Packages

In [2]:
import pandas as pd
import numpy as np

## Load in Data

In [3]:
survey_df = pd.read_csv('comma-survey.csv', index_col='RespondentID')

In [8]:
survey_df.head()

Unnamed: 0_level_0,"In your opinion, which sentence is more gramatically correct?","Prior to reading about it above, had you heard of the serial (or Oxford) comma?","How much, if at all, do you care about the use (or lack thereof) of the serial (or Oxford) comma in grammar?",How would you write the following sentence?,"When faced with using the word ""data"", have you ever spent time considering if the word was a singular or plural noun?","How much, if at all, do you care about the debate over the use of the word ""data"" as a singluar or plural noun?","In your opinion, how important or unimportant is proper use of grammar?",Gender,Age,Household Income,Education,Location (Census Region)
RespondentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
3292953864,"It's important for a person to be honest, kind...",Yes,Some,"Some experts say it's important to drink milk,...",No,Not much,Somewhat important,Male,30-44,"$50,000 - $99,999",Bachelor degree,South Atlantic
3292950324,"It's important for a person to be honest, kind...",No,Not much,"Some experts say it's important to drink milk,...",No,Not much,Somewhat unimportant,Male,30-44,"$50,000 - $99,999",Graduate degree,Mountain
3292942669,"It's important for a person to be honest, kind...",Yes,Some,"Some experts say it's important to drink milk,...",Yes,Not at all,Very important,Male,30-44,,,East North Central
3292932796,"It's important for a person to be honest, kind...",Yes,Some,"Some experts say it's important to drink milk,...",No,Some,Somewhat important,Male,18-29,,Less than high school degree,Middle Atlantic
3292932522,"It's important for a person to be honest, kind...",No,Not much,"Some experts say it's important to drink milk,...",No,Not much,,,,,,


## Compute Descriptive Statistics

In [12]:
n_obs = len(survey_df)
n_features = len(survey_df.columns)
print('The data set has {} observations and {} features'.format(n_obs, n_features))

The data set has 1129 observations and 12 features


### Column 1 (grammatically correct sentence)

In [15]:
survey_df.iloc[:,0].value_counts()

It's important for a person to be honest, kind, and loyal.    641
It's important for a person to be honest, kind and loyal.     488
Name: In your opinion, which sentence is more gramatically correct?, dtype: int64

In [19]:
print('The first column has {} missing values'.format(survey_df.iloc[:,0].isna().sum()))

The first column has 0 missing values


### Column 2 (Oxford comma)

In [16]:
survey_df.iloc[:,1].value_counts()

Yes    655
No     444
Name: Prior to reading about it above, had you heard of the serial (or Oxford) comma?, dtype: int64

In [22]:
print('The second column has {} missing values'.format(survey_df.iloc[:,1].isna().sum()))

The second column has 30 missing values


### Combined Columns 1 and 2

In [23]:
survey_df[survey_df.iloc[:,1]=='Yes'].iloc[:,0].value_counts()

It's important for a person to be honest, kind, and loyal.    423
It's important for a person to be honest, kind and loyal.     232
Name: In your opinion, which sentence is more gramatically correct?, dtype: int64

In [24]:
survey_df[survey_df.iloc[:,1]=='No'].iloc[:,0].value_counts()

It's important for a person to be honest, kind and loyal.     244
It's important for a person to be honest, kind, and loyal.    200
Name: In your opinion, which sentence is more gramatically correct?, dtype: int64

In [27]:
survey_df[survey_df.iloc[:,1].isna()].iloc[:,0].value_counts()

It's important for a person to be honest, kind, and loyal.    18
It's important for a person to be honest, kind and loyal.     12
Name: In your opinion, which sentence is more gramatically correct?, dtype: int64

Takeaway; among people who know about the Oxford comma, more people choose the correct sentence in the first column (the one that has the serial comma inside). Furthermore, every respondent has entered the first column in the survey, whereas 30 people have not entered a response for column 2. 

### Column 3

In [29]:
survey_df.iloc[:,2].value_counts()

Some          414
A lot         291
Not much      268
Not at all    126
Name: How much, if at all, do you care about the use (or lack thereof) of the serial (or Oxford) comma in grammar?, dtype: int64

In [36]:
print('The third column has {} missing values'.format(survey_df.iloc[:,2].isna().sum()))
print(survey_df[survey_df.iloc[:,2].isna()].iloc[:,1].value_counts())

The third column has 30 missing values
Series([], Name: Prior to reading about it above, had you heard of the serial (or Oxford) comma?, dtype: int64)


Takeaway; the third column has as many missing values as the second column. As indicated in the second above print statement, these 30 respondents are the same across these two columns, implying that no-one who did not answer the second survey question did answer the third question. 

### Combined Columns 2 and 3

In [33]:
survey_df[survey_df.iloc[:,1]=='Yes'].iloc[:,2].value_counts()

Some          243
A lot         229
Not much      141
Not at all     42
Name: How much, if at all, do you care about the use (or lack thereof) of the serial (or Oxford) comma in grammar?, dtype: int64

In [34]:
survey_df[survey_df.iloc[:,1]=='No'].iloc[:,2].value_counts()

Some          171
Not much      127
Not at all     84
A lot          62
Name: How much, if at all, do you care about the use (or lack thereof) of the serial (or Oxford) comma in grammar?, dtype: int64

Among people who do not know about the Oxford people, a larger fraction does not care about the use of this concept in grammar than the same fraction among people who do about this concept.

### Combined Columns 1 and 3

To be continued...

### Column 4

In [35]:
survey_df.iloc[:,3].value_counts()

Some experts say it's important to drink milk, but the data is inconclusive.     865
Some experts say it's important to drink milk, but the data are inconclusive.    228
Name: How would you write the following sentence?, dtype: int64

In [40]:
print('The fourth column has {} missing values'.format(survey_df.iloc[:,3].isna().sum()))

The fourth column has 36 missing values


Most people use singular when writing about data, and we have 6 extra missing responses compared to the previous two columns.

### Column 5

In [38]:
survey_df.iloc[:,4].value_counts()

No     547
Yes    544
Name: When faced with using the word "data", have you ever spent time considering if the word was a singular or plural noun?, dtype: int64

In [39]:
print('The fifth column has {} missing values'.format(survey_df.iloc[:,4].isna().sum()))

The fifth column has 38 missing values


About half of the respondents has ever spent time considering whether data is singular or plural. Also, we have two more missing values compared to the previous column.

### Column 6

In [41]:
survey_df.iloc[:,5].value_counts()

Not much      403
Some          352
Not at all    203
A lot         133
Name: How much, if at all, do you care about the debate over the use of the word "data" as a singluar or plural noun?, dtype: int64

In [42]:
print('The sixth column has {} missing values'.format(survey_df.iloc[:,5].isna().sum()))

The sixth column has 38 missing values


The majority of people do not care all too much about the use of the word data. We have as many missing values as for the previous column.

### Columns 4-5-6 compared

To be continued

### Column 7

In [43]:
survey_df.iloc[:,6].value_counts()

Very important                                 688
Somewhat important                             333
Neither important nor unimportant (neutral)     26
Somewhat unimportant                             7
Very unimportant                                 5
Name: In your opinion, how important or unimportant is proper use of grammar?, dtype: int64

In [44]:
print('The seventh column has {} missing values'.format(survey_df.iloc[:,6].isna().sum()))

The seventh column has 70 missing values


Most people state they find proper use of grammar to be important. Also, we have almost double as many missing values for this question compared to the previous question.

### Column 8

In [45]:
survey_df.iloc[:,7].value_counts()

Female    548
Male      489
Name: Gender, dtype: int64

In [46]:
print('The eigth column has {} missing values'.format(survey_df.iloc[:,7].isna().sum()))

The eigth column has 92 missing values


We have slightly more females than males in the survey. Also, the number of missing values for this question has increased to 92.

### Column 9

In [47]:
survey_df.iloc[:,8].value_counts()

45-60    290
> 60     272
30-44    254
18-29    221
Name: Age, dtype: int64

In [48]:
print('The ninth column has {} missing values'.format(survey_df.iloc[:,8].isna().sum()))

The ninth column has 92 missing values


In [49]:
survey_df[survey_df.iloc[:,7].isna()].iloc[:,8].value_counts()

Series([], Name: Age, dtype: int64)

Everyone who stated their gender also stated their age group. The number of respondents increases slightly with age group. 

### Column 10

In [50]:
survey_df.iloc[:,9].value_counts()

$50,000 - $99,999      290
$100,000 - $149,999    164
$25,000 - $49,999      158
$0 - $24,999           121
$150,000+              103
Name: Household Income, dtype: int64

In [51]:
print('The tenth column has {} missing values'.format(survey_df.iloc[:,9].isna().sum()))

The tenth column has 293 missing values


The biggest income group is \$50,000 to \$99,999, and the smallest two groups are given by the two most extreme groups (\$0-\$24,999 and \$150,000+). What's notable is that the number of missing values has more than tripled compared to previous question, indicating that a substantial portion of respondents did not (want to) answer a question about their household income.

### Column 11

In [52]:
survey_df.iloc[:,10].value_counts()

Bachelor degree                     344
Some college or Associate degree    295
Graduate degree                     276
High school degree                  100
Less than high school degree         11
Name: Education, dtype: int64

In [53]:
print('The eleventh column has {} missing values'.format(survey_df.iloc[:,10].isna().sum()))

The eleventh column has 103 missing values


The biggest education group is those having a Bachelor degree, whereas only a small portion of respondents has just a high school degree or less. It is interesting to see that although the previous (income) question had 293 missing values, this question has only 103 missing values.

### Column 12

In [54]:
survey_df.iloc[:,11].value_counts()

Pacific               180
East North Central    170
South Atlantic        164
Middle Atlantic       140
West South Central     88
Mountain               87
West North Central     82
New England            73
East South Central     43
Name: Location (Census Region), dtype: int64

In [55]:
print('The twelfth column has {} missing values'.format(survey_df.iloc[:,11].isna().sum()))

The twelfth column has 102 missing values


We have the same number of missing values as last column. The largest location groups are 'Pacific', 'East North Central', 'South Atlantic', and 'Middle Atlantic'. 

### Missing Data Overall

In [56]:
survey_df.isna().sum()

In your opinion, which sentence is more gramatically correct?                                                               0
Prior to reading about it above, had you heard of the serial (or Oxford) comma?                                            30
How much, if at all, do you care about the use (or lack thereof) of the serial (or Oxford) comma in grammar?               30
How would you write the following sentence?                                                                                36
When faced with using the word "data", have you ever spent time considering if the word was a singular or plural noun?     38
How much, if at all, do you care about the debate over the use of the word "data" as a singluar or plural noun?            38
In your opinion, how important or unimportant is proper use of grammar?                                                    70
Gender                                                                                                                

In [66]:
print('The total number of rows with missing data is {}'.format(survey_df.isna().any(axis=1).sum()))
print('This constitutes {}% of observations'.format(np.round(100*survey_df.isna().any(axis=1).sum()/n_obs), 2))

The total number of rows with missing data is 304
This constitutes 27.0% of observations


There is some mixing in which respondents have missing data in columns, i.e., the question with the maximum missingness does not give the total of rows in our data with missing data. The total number of rows with some degree of missingness constitutes around 27\% of our observations.