# Workbook : Data Wrangling

To get a feel for data wrangling and to help get you started thinking about your second assignment. We'll work through an example in this workbook to:
* read a DataFrame into Python
* wrangle the data
* clean the data

# Part I: Setup

Data wrangling often requires additional functionality outside what's included in Python by default. For this, we'll import other functions from helpful packages.

**Import the following packages using their common shortened name found in parentheses:**

* `numpy` (`np`)
* `pandas` (`pd`)

In [None]:
## YOUR CODE HERE

**Run the following cell code to make things throughout the rest of this workbook a little prettier.** (Note: You don't have to edit code here, but are free to and see what changes to be sure you understand each line.)

In [None]:
# Configure libraries

# Don't display too many rows/cols of DataFrames
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8

# Round decimals when displaying DataFrames
pd.set_option('precision', 2)

**Read the CSV file at the URL 'https://raw.githubusercontent.com/fivethirtyeight/data/master/steak-survey/steak-risk-survey.csv' into Python in and assign it to the variable `survey`**.

In [None]:
## YOUR CODE HERE

In [None]:
assert survey.shape == (551, 15)

These data contain survey responses from Americans who responded to a SurveyMonkey Audience poll. These data were used in the [FiveThirtyEight](https://fivethirtyeight.com) article: *[How Americans Like Their Steak](https://fivethirtyeight.com/features/how-americans-like-their-steak/)*

# Part II: Wrangling

**Write a line of code to look at the first few rows of the DataFrame**

In [None]:
survey.head()

What do you notice about the first row of the dataframe? Notice that it's not actually an observation from a respondent? **Remove this row from the dataset. Assign this back to the variable `survey`. Print the first few rows again to make sure you've accomplished this.**

In [None]:
## YOUR CODE HERE 

In [None]:
assert survey.shape == (550, 15)

Notice that there are a lot of different questions that were asked of respondents (columns) and 550 people who responded to the survey (rows). 

**Print a list of all the column names in this DataFrame.**

We'll only end up working with a subset of these.

In [None]:
## YOUR CODE HERE

Now we have a sense of what information is included in the dataset. In the coming weeks, we'll answer the following questions:
1. Who cheats more on their significant other - males or females?
2. Are cigarette smokers less likely to skydive?
3. Do people in New England gamble more than other parts of the country?

To answer these we'll only need data from *some* of the columns in the dataset.

Let's drop the columns we don't need. **Drop the first two columns from the dataset. This should still be assigned to the variable `survey`.**

In [None]:
## YOUR CODE HERE

In [None]:
assert survey.shape == (550,13)

Now that we've got the columns we want, let's clean up those column names. **Rename the columns in `survey` so the appropriate columns have the following names:**

* smoking
* alcohol
* gamble
* skydiving
* speeding
* cheated
* steak
* steak_preference
* gender
* age
* income 
* education
* region


In [None]:
## YOUR CODE HERE


In [None]:
assert list(survey) == ['smoking',
                        'alcohol',
                        'gambling',
                        'skydiving',
                        'speeding',
                        'cheated',
                        'steak',
                        'steak_preference',
                        'gender',
                        'age',
                        'income',
                        'education',
                        'region']

We're in pretty good shape now. **Print the first few rows of the `survey` DataFrame to see what the data look like at this point.**

In [None]:
survey.head()

# Part III: Cleaning

**Now that we've go the data we need, let's tet a sense for how much missing data there is in this dataset by determining how many null-containing rows there are in `survey`. Assign this value to the variable `null_rows`.**

In [None]:
## YOUR CODE HERE

In [None]:
assert null_rows == 217

Good to know that lots of people didn't answer every question. We'll keep that in mind as we work with this dataset.

Simply dropping missing observations is typically not good practice; however, in this case we'll drop observations that have missing data across the entire row, as these are individuals who didn't participate in the survey at all. **Remove rows where ALL the columns have missing data for that participant.**

In [None]:
## YOUR CODE HERE

In [None]:
assert survey.shape == (541, 13)

**Print the first few rows to remind yourself what the data look like at this point.**

In [None]:
## YOUR CODE HERE

Note that the first row no longer has all missing data here. We've got a dataset we can work with now!

**Great work on this workbook! We'll continue to work with this dataset in section to answer our questions of interest. You can work on A2 or help a classmate work through this - we always understand things best once we've had to explain them to someone else.**