# Fake data

This is a notebook for generating data for use in the exerises, including:

1) fictious names with a high proportion of Steves

2) normally distributed weights in pounds

3) normally distributed heights in feet

This uses dataframes, so if it doesn't make sense to you, that's perfectly normal. So far you've only learned about Series objects.

Note: this does not take sex into account when generating weights and heights, which you could do if you were less lazy than I am.

In [1]:
# Define URLs to pull from 
CENSUS_SURNAME_URL = 'https://www2.census.gov/topics/genealogy/2010surnames/Names_2010Census_Top1000.xlsx'
HADLEY_FORENAME_URL = 'https://github.com/hadley/data-baby-names/raw/master/baby-names.csv'

In [2]:
# imports
import pandas as pd
import scipy.stats as stats

# Name generation

In [3]:
# Pandas can read CSVs directly from the internet
forenames = pd.read_csv(HADLEY_FORENAME_URL)
forenames.head()

Unnamed: 0,year,name,percent,sex
0,1880,John,0.081541,boy
1,1880,William,0.080511,boy
2,1880,James,0.050057,boy
3,1880,Charles,0.045167,boy
4,1880,George,0.043292,boy


In [4]:
# And now an Excel
surnames = pd.read_excel(CENSUS_SURNAME_URL, header=1)['SURNAME']

# Drop junk on end
surnames = surnames.head(1000).str.title()

In [5]:
# Get forenames from after 2000.
modern_forenames = forenames.loc[forenames['year'] > 2000, 'name']

# Get a sample of 1,000.
forename_sample = modern_forenames.sample(1000)

# Replace 10% of those with 'Steve'
forename_sample.iloc[:int(len(forename_sample) / 10)] = 'Steve'
forename_sample.index = pd.RangeIndex(0,1000)# Show

In [6]:
# Show
forename_sample.head()

0    Steve
1    Steve
2    Steve
3    Steve
4    Steve
Name: name, dtype: object

In [7]:
# Show
surnames.head()

0       Smith
1     Johnson
2    Williams
3       Brown
4       Jones
Name: SURNAME, dtype: object

In [8]:
# Make combos
full_names = forename_sample + ' ' + surnames

In [9]:
# Show example
full_names.sample(5)

592     Meadow Carson
877      Amirah Avery
445    Jaylene Walton
746     Seth Stafford
872    Nikolas Bernal
dtype: object

In [10]:
# Write to file.
full_names.to_csv('data/subject_names.csv')

# Weight Generation

Using a normally distributed height of 172 pounds and a standard deviation of 29 pounds.

In [11]:
# Create random data
weight = pd.Series(
    data=stats.norm.rvs(loc=172, scale=29, size=1000)
)

# Round it
rounded_w = weight.round(2)

# Write it to disk.
rounded_w.to_csv('data/weight_in_pounds.csv')

# Display
rounded_w.head(5)

0    143.16
1     92.43
2    171.60
3    172.46
4    207.19
dtype: float64

# Height Generation

Using a normally distributed height of 5.66 feet (68 inches) and stdev of .33 (4 inches).

In [12]:
# Create random data
height = pd.Series(
    data=stats.norm.rvs(loc=5.66, scale=.33, size=1000)
)

# Round it
rounded_h = height.round(2)

# Write it to disk.
rounded_h.to_csv('data/height_in_feet.csv')

# Display
rounded_h.head()

0    5.57
1    5.70
2    6.12
3    5.90
4    5.87
dtype: float64