# Analyze the Survey Data

Analyze the survey data in `survey.csv` using the power of PYTHON!!

Also STATISTICS!!

And maybe some other made up stuff

## Installing

In a clean `python3` environment, run:

In [None]:
!pip install -r requirements.txt

Imports and etc

In [None]:
import numpy as np

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

## Loading the Raw Data

Load the data

In [None]:
df = pd.read_csv("survey.csv")

In [None]:
df.head()

## Data cleaning and normalization

Alias categories for easier plotting

In [None]:
df['gender'] = df['Gender identity?']
df['identity'] = df['What is your sexual identity?']
df['age'] = df['What is your age?']
df['willing_to_date'] = df['What is the age range you are willing to date a member of your preferred sex (MPS)? (Check all that apply)']
df['prefer_to_date'] = df['What is the age range you would PREFER/WANT to date?']
df['reason'] = df['For each of the age ranges you checked off that you would PREFER to date, please provide a short reason why you would date that age range']

df = df[['gender', 'identity', 'age', 'willing_to_date', 'prefer_to_date', 'reason']]

In [None]:
df.head()

Encode the category ranges to numeric values for ease of analysis

In [None]:
# Map the long category names to numeric values
preferred_date_map = {
    '10+ years older': (10, 99),
    '10+ years younger': (-99, -10),
    '3 years older to same age': (0, 3),
    '3 years older to same old': (0, 3),
    '3-5 years younger': (-5, -3),
    '3-6 years older': (3, 6),
    '5-10 years younger': (-10, -5),
    '7-10 years older': (7, 10),
    'Same age to 2 years younger': (-2, 0),
}

def encode_ranges(cat: str) -> tuple:
    """ Encode the range checkboxes into a bounding range """
    if cat in (None, np.nan):
        return pd.Series({'min': np.nan, 'max': np.nan})
    ranges = np.array([preferred_date_map[c] for c in cat.split(';') if c in preferred_date_map])
    return pd.Series({'min': np.min(ranges), 'max': np.max(ranges)})
    

# Encode the checkboxes into ranges
willing_range = df['willing_to_date'].apply(encode_ranges)
df['min_willing'] = willing_range['min']
df['max_willing'] = willing_range['max']

prefer_range = df['prefer_to_date'].apply(encode_ranges)
df['min_prefer'] = prefer_range['min']
df['max_prefer'] = prefer_range['max']

# Encode the ages into ranges
age_map = {
    '18-22': (18, 22),
    '23-25': (23, 25),
    '26-29': (26, 29),
    '30+': (30, 99),
}
age_map = {k: pd.Series({'min': v0, 'max': v1})
           for k, (v0, v1) in age_map.items()}
age_range = df['age'].apply(lambda x: age_map[x])
df['min_age'] = age_range['min']
df['max_age'] = age_range['max']

# Drop the unused columns again
df = df[['gender', 'identity', 'age', 'min_age', 'max_age', 'min_willing', 'max_willing', 'min_prefer', 'max_prefer', 'reason']]


In [None]:
# Get some weird NA columns, so just drop them
print('Before dropping invalid values: {}'.format(df.shape))
df = df.dropna(axis=0, how='any')
print('After dropping invalid values: {}'.format(df.shape))
df.head()

## Demographic distributions

The survey actually looks relatively balanced for gender identity considering how biased the sampling was...


(of course than `Other/Non binary` which would require more sophisticated methodology)

In [None]:
sns.countplot(data=df, x='gender');

It's fairly unbalanced for sexual identity, but that was always going to be hard with a random friend sample...

In [None]:
sns.countplot(data=df, x='identity');

And we have a nice bell curve for age...

Excercise left to the reader: *If you are 27 years old, do the ages of your friends follow a normal distribution?*

In [None]:
sns.countplot(data=df, x='age', order=['18-22', '23-25', '26-29', '30+']);