# Wrangle Exercises
---
## Exercises I
Let's review the steps we take at the beginning of each new module.

1. Create a new repository named `regression-exercises` in your GitHub; all of your Regression work will be housed here.
2. Clone this repository within your local `codeup-data-science` directory.
3. Create a `.gitignore` and make sure your list of 'files to ignore' includes your `env.py` file.
4. Ceate a `README.md` file that outlines the contents and purpose of your repository.
5. Add, commit, and push these two files.
6. Now you can add your `env.py` file to this repository to access the Codeup database server.
7. For these exercises, you will create `wrangle.ipynb` and `wrangle.py` files to hold necessary functions.
8. As always, add, commit, and push your work often.
---
## Exercises II
Let's set up an example scenario as perspective for our regression exercises using the Zillow dataset.

As a Codeup data science graduate, you want to show off your skills to the Zillow data science team in hopes of getting an interview for a position you saw pop up on LinkedIn. You thought it might look impressive to build an end-to-end project in which you use some of their Kaggle data to predict property values using some of their available features; who knows, you might even do some feature engineering to blow them away. Your goal is to predict the values of single unit properties using the obervations from 2017.

In these exercises, you will complete the first step toward the above goal: acquire and prepare the necessary Zillow data from the zillow database in the Codeup database server.

1. Acquire `bedroomcnt`, `bathroomcnt`, `calculatedfinishedsquarefeet`, `taxvaluedollarcnt`, `yearbuilt`, `taxamount`, and `fips` from the `zillow` database for all 'Single Family Residential' properties.

In [1]:
# import pandas
import pandas as pd
# function to write url for sql database
def get_url(db):
    from env import user, password, host
    return f'mysql+pymysql://{user}:{password}@{host}/{db}'

In [3]:
# get url for zillow database
url = get_url('zillow')
# sql query to acquire data
sql = '''
SELECT bedroomcnt, 
       bathroomcnt,
       calculatedfinishedsquarefeet,
       taxvaluedollarcnt,
       yearbuilt,
       taxamount,
       fips
FROM properties_2017
WHERE propertylandusetypeid = 261;
'''
# assign sql query result (dataframe) to variable
zillow = pd.read_sql(sql, url)
zillow.head()

Unnamed: 0,bedroomcnt,bathroomcnt,calculatedfinishedsquarefeet,taxvaluedollarcnt,yearbuilt,taxamount,fips
0,0.0,0.0,,27516.0,,,6037.0
1,0.0,0.0,,10.0,,,6037.0
2,0.0,0.0,,10.0,,,6037.0
3,0.0,0.0,,2108.0,,174.21,6037.0
4,4.0,2.0,3633.0,296425.0,2005.0,6941.39,6037.0


In [6]:
# make function to acquire/cache data
def acquire_zillow():
    '''
    This function takes no arguments and returns a dataframe of 2017 Single Family 
    Residential property data from Zillow. It searches for a csv file (zillow.csv)
    with the requested data and reads that file into a dataframe. If the csv
    file is not found, it retrieves the SQL query result and reads it into a 
    dataframe. It then caches this data into a csv file (zillow.csv).
    '''
    import os
    if os.path.isfile('zillow.csv'):
        zillow = pd.read_csv('zillow.csv', index_col=0)
        return zillow
    else:
        from env import user, password, host
        url = f'mysql+pymysql://{user}:{password}@{host}/zillow'
        sql = '''
        SELECT bedroomcnt, 
               bathroomcnt,
               calculatedfinishedsquarefeet,
               taxvaluedollarcnt,
               yearbuilt,
               taxamount,
               fips
        FROM properties_2017
        WHERE propertylandusetypeid = 261;
        '''
        zillow = pd.read_sql(sql, url)
        zillow.to_csv('zillow.csv')
        return zillow
# check that function works
zillow = acquire_zillow()
zillow.head()

Unnamed: 0,bedroomcnt,bathroomcnt,calculatedfinishedsquarefeet,taxvaluedollarcnt,yearbuilt,taxamount,fips
0,0.0,0.0,,27516.0,,,6037.0
1,0.0,0.0,,10.0,,,6037.0
2,0.0,0.0,,10.0,,,6037.0
3,0.0,0.0,,2108.0,,174.21,6037.0
4,4.0,2.0,3633.0,296425.0,2005.0,6941.39,6037.0


2. Using your acquired Zillow data, walk through the summarization and cleaning steps in your `wrangle.ipynb` file like we did above. You may handle the missing values however you feel is appropriate and meaningful; remember to document your process and decisions using markdown and code commenting where helpful.

In [7]:
zillow.shape

(2152863, 7)

3. Store all of the necessary functions to automate your process from acquiring the data to returning a cleaned dataframe with no missing values in your `wrangle.py` file. Name your final function `wrangle_zillow`.

In [None]:
def wrangle_zillow():
    '''
    This function takes in no arguments and returns a clean dataframe of Single Family
    Residential property data from Zillow.
    '''