# Fake data

This is a notebook for generating data for use in the exerises, including:

1) fictious names with a high proportion of Steves

2) normally distributed weights in pounds

3) normally distributed heights in feet

4) embassy staff and spies.

This uses dataframes, so if it doesn't make sense to you, that's perfectly normal. So far you've only learned about Series objects.

Note: this does not take sex into account when generating weights and heights, which you could do if you were less lazy than I am.

In [None]:
# Define URLs to pull from 
CENSUS_SURNAME_URL = 'https://www2.census.gov/topics/genealogy/2010surnames/Names_2010Census_Top1000.xlsx'
HADLEY_FORENAME_URL = 'https://github.com/hadley/data-baby-names/raw/master/baby-names.csv'

In [None]:
# imports
import pandas as pd
import scipy.stats as stats

# Name generation

In [None]:
# Pandas can read CSVs directly from the internet
forenames = pd.read_csv(HADLEY_FORENAME_URL)
forenames.head()

In [None]:
# And now an Excel
surnames = pd.read_excel(CENSUS_SURNAME_URL, header=1)['SURNAME']

# Drop junk on end
surnames = surnames.head(1000).str.title()

In [None]:
# Get forenames from after 2000.
modern_forenames = forenames.loc[forenames['year'] > 2000, 'name']

# Get a sample of 1,000.
forename_sample = modern_forenames.sample(1000)

# Replace 10% of those with 'Steve'
forename_sample.iloc[:int(len(forename_sample) / 10)] = 'Steve'
forename_sample.index = pd.RangeIndex(0,1000)# Show

In [None]:
# Show
forename_sample.head()

In [None]:
# Show
surnames.head()

In [None]:
# Make combos
full_names = forename_sample + ' ' + surnames

In [None]:
# Show example
full_names.sample(5)

In [None]:
# Write to file.
full_names.name = 'names'
full_names.to_csv('data/subject_names.csv', header=True, index=False)

# Weight Generation

Using a normally distributed height of 172 pounds and a standard deviation of 29 pounds.

In [None]:
# Create random data
weight = pd.Series(
    data=stats.norm.rvs(loc=172, scale=29, size=1000)
)

# Round it
rounded_w = weight.round(2)

# Write it to disk.
rounded_w.name = 'pounds'
rounded_w.to_csv('data/weight_in_pounds.csv', header=True, index=False)

# Display
rounded_w.head(5)

# Height Generation

Using a normally distributed height of 5.66 feet (68 inches) and stdev of .33 (4 inches).

In [None]:
# Create random data
height = pd.Series(
    data=stats.norm.rvs(loc=5.66, scale=.33, size=1000)
)

# Round it
rounded_h = height.round(2)

# Write it to disk.
rounded_h.name = 'feet'
rounded_h.to_csv('data/height_in_feet.csv', header=True, index=False)

# Display
rounded_h.head()

# Spy and Staff Generation

Create a list of spies and embassy staff.

In [None]:
spies = pd.Series({
    'Mata'    : 40,
    'Casanova': 41,
    'Julius'  : 51,
    'Ethel'   : 48,
    'Klaus'   : 35,
    'Belle'   : 30,
    'Valery'  : 28,
})

embassy_staff = pd.Series({
    'Dave'    : 30,
    'Julius'  : 51,
    'Ethel'   : 48,
    'Jenna'   : 25,
    'Klaus'   : 35,
    'Aloysius': 84,
    'Carlos'  : 40,
    'Michael' : 28,
    'Tito'    : 32,
    'Jermaine': 30,
    'Janet'   : 28,
    'Marlon'  : 25,
    'Jackie'  : 22,
})

# No headers
spies.to_csv('data/spies.csv')
embassy_staff.to_csv('data/embassy_staff.csv')