# Data Collection and Cleaning
---

## Table of Contents
 #### 1.  [Getting data from the REST API](#getting_data)
 #### 2.  [Creating the intakes Data Frame](#intakes)
 #### 3.  [Creating the outcomes Data Frame](#outcomes)
 #### 4.  [Joining together](#joining)
 #### 5.  [Exporting the cleaned data](#exporting)
 ---

## 1. Getting data from the REST API <a id="getting_data"></a>

In [1]:
import numpy as np
import pandas as pd
import requests
from pandas.io.json import json_normalize
from geopy.geocoders import Nominatim
from sodapy import Socrata
from pygeocoder import Geocoder
import warnings
warnings.filterwarnings('ignore')

## 2. Creating the intakes Data Frame <a id="intakes"></a>

In [None]:
client = Socrata("data.austintexas.gov", None)
results = client.get("fdzn-9yqv", limit=100000)

In [3]:
intake_df = pd.DataFrame.from_records(results)
intake_df = intake_df.set_index("animal_id")

#### I. Rename datetime and sex to normal names


In [4]:
#### I. Rename datetime and sex to normal names
intake_df.drop(['datetime2'], axis=1, inplace=True)
intake_df.rename(columns={'datetime': 'date_in', 'sex_upon_intake': 'sex'}, inplace=True)


#### II. Remove other animals except dogs


In [5]:
intake_df = intake_df.loc[intake_df['animal_type'] == "Dog"]

#### III. Convert `color` to a list


In [6]:
intake_df.color = intake_df.color.str.split('/')

#### IV. Standardize addresses to use later


In [7]:
intake_df.found_location = intake_df['found_location'].str[:-5].str.replace(" in ", " ")

#### V. Create two separate columns, `fixed` and `sex` from `sex`


In [9]:
sex_series = intake_df.sex.str.split(" ")
intake_df['fixed'] = sex_series.str[0]
intake_df['sex'] = sex_series.str[1]
intake_df.fixed = intake_df.fixed.map({
    "Neutered": "Yes",
    "Spayed": "Yes",
    "Intact": "No"
})

#### VI. Clean up the dogs' names


In [10]:
intake_df.name = intake_df.name.str.replace("*", "")

#### VII. Convert `date_in` to `datetime` object


In [11]:
intake_df.date_in = intake_df.date_in.apply(pd.to_datetime)

## 3. Creating the outcomes Data Frame <a id="outcomes"></a>

In [13]:
client = Socrata("data.austintexas.gov", None)
results = client.get("9t4d-g238", limit=100000)
outcomes_df = pd.DataFrame.from_records(results)
outcomes_df = outcomes_df.set_index("animal_id")



#### I. Rename and convert `datetime` to `date_out` as a `datetime` object

In [15]:
outcomes_df.rename(columns={'datetime': 'date_out'}, inplace=True)

In [16]:
outcomes_df.date_out = outcomes_df.date_out.apply(pd.to_datetime)

#### II. Remove unnecessary columns from `outcomes_df`

In [17]:
outcomes_df = outcomes_df[['date_of_birth', 'date_out', "outcome_subtype", "outcome_type"]]

## 4. Joining together <a id="joining"></a>

#### I. Outer merge `intake_df` and `outcomes_df`

In [18]:
combined_df = intake_df.merge(outcomes_df, on="animal_id", how="outer")

#### II. Drop incorrectly entered rows (rows that don't have `date_in`)

In [19]:
combined_df = combined_df.dropna(axis=0, subset=['date_in'])

#### III. Create column `in_shelter` if `outcome_type` has a value

In [20]:
combined_df['in_shelter'] = "No"
combined_df['in_shelter'][combined_df.outcome_type.isnull()] = "Yes"

#### IV. Create column `time_in_shelter` and remove invalid rows (negative times)

In [22]:
combined_df['time_in_shelter'] = combined_df.date_out - combined_df.date_in

In [25]:
mask = ((combined_df.time_in_shelter < pd.Timedelta(0)))
combined_df = combined_df.drop(mask.loc[mask == True].index)

#### VI. Create `age_in` and `age_out` from the dog's birthday

In [28]:
combined_df.date_of_birth = combined_df.date_of_birth.apply(pd.to_datetime)

In [29]:
combined_df["age_in"] = combined_df.date_in - combined_df.date_of_birth
combined_df.age_in = (combined_df.age_in.apply(lambda x: x.days)/365).round().replace(0.0, 0.5)

In [30]:
combined_df["age_out"] = combined_df.date_out - combined_df.date_of_birth
combined_df.age_out =  (combined_df.age_out.apply(lambda x: x.days)/365).round().replace(0.0, 0.5)

#### VII. Remove duplicate rows (from user entry error)

In [31]:
features = combined_df.columns.tolist()
features.remove("color")
features.remove("date_out")
features.remove("outcome_type")
combined_df = combined_df.drop_duplicates(subset=features)

In [32]:
combined_df.drop(['age_upon_intake', 'animal_type'], axis=1, inplace=True)

#### VIII. Create a new Data Frame without repeat-admit dogs

In [33]:
combined_unique_df = combined_df[~combined_df.index.duplicated(keep='first')]

## 5. Exporting the cleaned data <a id="exporting"></a>

In [36]:
combined_df.to_csv('austin_shelter.csv')

In [37]:
combined_unique_df.to_csv('unique_austin_shelter.csv')

---

### Get geocodes from addresses

*NOTE: DO NOT RUN THESE CELLS*