# Module 1 - Introduction to Pandas
## Pandas Part 1

### Introduction

![austin](http://www.austintexas.gov/sites/default/files/aac_logo.jpg)
You have decided that you want to start your own animal shelter, but you want to get an idea of what that will entail and get more information about planning. 

You have found out that Austin has one of the largest no-kill animal shelters in the country, and they keep meticulous track of animals that have been taken in and released. 

However, there are challenges:
- it is a [large file](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Intakes/wter-evkm)
- the online visualization tools provided are terrible
- the data is sorted as strings
- the file holds an overwhelming amount  of information. Is there an easy way to look at this data? Can we do this with base Python? Is there a better way?


#### _Our goals today are to be able to_: <br/>

- Import/read data using Pandas
- Identify Pandas objects and manipulate Pandas objects by index and columns
- Filter data using Pandas

We will do this with the Austin data and with an animal-related dataset from NYC.

### Activation:

<img src="https://cdn-images-1.medium.com/max/1600/1*9IU5fBzJisilYjRAi-f55Q.png" width=700, height=700>  




- The data manipulation capabilities of pandas are built on top of the numpy library.
- Pandas dataframe object represents a spreadsheet with cell values, column names, and row index labels.

### _Big questions for this lesson_: Why use Pandas? 
 
 (a) Provides methods able analyze data stored in the format Data Scientist most often encounter (.csv, .tsv, or .xlsx). 
 
 (b) Makes it very convenient to load, process, and analyze in the aforementioned formats. 
 
 (c) Along with python visualization packages allows for the visual analysis of tabular data.


### Qualities of a pandas DataFrame
- The data structures in Pandas are implemented using series and dataframe classes.  

- A series is a one-dimensional indexed array of some fixed data type.  
- While a dataframe is a two-dimensional data structure like a table where each column contains data of the same type.

- DataFrames are great for representing real data: rows correspond to instances (examples, observations, etc.), and columns correspond to features of these instances.

### What are the **_disadvantages_** of using Pandas?<br>                    
https://wesmckinney.com/blog/apache-arrow-pandas-internals/

When do we want to use NumPy versus Pandas?
- What are the advantages of using Pandas?    
https://stackabuse.com/beginners-tutorial-on-the-pandas-python-library/

### 1. Importing and reading data with Pandas!

#### Let's use pandas to read some csv files so we can interact with them.

In [1]:
# First, let's check which directory we are in so the files we expect to see are there.
!pwd #or chdir
!ls -al

/home/will/DS_inclass/dc-ds-071519/1-Module/1-Section/day_5_lecture_1_pandas
total 116
drwxr-xr-x 3 will will  4096 Jul 19 13:08 .
drwxr-xr-x 6 will will  4096 Jul 19 13:07 ..
-rw-r--r-- 1 will will    62 Jul 19 13:07 example1.csv
-rw-r--r-- 1 will will 63117 Jul 19 13:07 excelpic.jpg
-rw-r--r-- 1 will will 24833 Jul 19 13:07 intro_to_pandas.ipynb
drwxr-xr-x 2 will will  4096 Jul 19 13:08 .ipynb_checkpoints
-rw-r--r-- 1 will will   238 Jul 19 13:07 made_up_jobs.csv
-rw-r--r-- 1 will will  2471 Jul 19 13:07 map_zip_nyc_hood.csv


In [6]:
import pandas as pd

pd.set_option("display.precision", 2)

Getting help with a function:

In [3]:
pd.DataFrame?

In [7]:
pd.DataFrame()

### Getting data in to pandas

There is also `read_excel` and many other pandas `read` functions.  
http://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

In [8]:
example_csv=pd.read_csv('example1.csv')
example_csv.head()

Unnamed: 0,Title1,Title2,Title3
0,one,two,three
1,example1,example2,example3


You can also load in data by using the url of an associated dataset.

In [9]:
shelter_data=pd.read_csv('https://data.austintexas.gov/api/views/9t4d-g238/rows.csv?accessType=DOWNLOAD') 
#this link is copied directly from the download option for CSV

KeyboardInterrupt: 

In [None]:
shelter_data.head()

Now that we can read in data, let's get more comfortable with our Pandas data structures.

In [None]:
type(shelter_data)

In [None]:
# Now that data is read let's look at it's shape
shelter_data.shape

In [None]:
#What are the names of the columns
shelter_data.columns

In [None]:
#What are the different data types present in our data
shelter_data.info()

In [None]:
shelter_data.dtypes

In [None]:
# We can find the type of a particular columns in a data frame in this way.
ID_series=shelter_data['Animal ID'] 
shelter_data['Animal ID'].dtypes

### 2. Utilizing and identifying Pandas objects

- What is a DataFrame object and what is a Series object? 
- How are they different from Python lists?

These are questions we will cover in this section. To start, let's start with this list of pets.

In [None]:
#define your list here!

dogs = ['bulldog','labs','great dane','shitzu','bull terrier']

print(dogs)

Using our list of dogs, we can create a pandas object called a 'series' which is much like an array or a vector.

In [None]:
dogs_series = pd.Series(dogs)

print(dogs_series)
type(dogs_series)

One difference between python **list objects** and pandas **series objects** is the fact that you can define the index manually for a **series objects**.

In [None]:
ind = ['a','b','c','d','e']

dogs_series = pd.Series(dogs,index=ind)

print(dogs_series)

### Other ways to make DataFrames

We can do a simliar thing with Python **dictionaries**. This time, however, we will create a DataFrame object from a python dictionary.

In [None]:
# Dictionary with list object in values
pet_dict = {
    'name' : ['Samantha', 'Alex', 'Dante'],
    'age' : ['4','2','3'],
    'animal' : ['cat', 'dog', 'dog']
}

pet_df = pd.DataFrame(pet_dict)

pet_df.head()

In [None]:
#to find data types of columns
pet_df.dtypes

### Data type conversion by columns
Let's change the data type of ages to int.

In [None]:
# We can also change a columns type but the change has to make sense.
pet_df.age = pet_df.age.astype(int)

#Uncomment line below and observe what happens when trying to convert student's name to int or float
#pet_df.name = pet_df.name.astype(int)

#How about what happens converting numeric to string
#pet_df.age = pet_df.age.astype(str)

pet_df.dtypes

In [None]:
pet_df.name = pet_df.name.str.lower()
pet_df.head()

### Custom index

We can also use a custom index for these items. For example, we might want them to be the individual pet ID numbers.

In [None]:
pet_ids = ['1111','1145','0096']

#Notice here we use pd.DataFrame not pd.Series as we did for a pandas series.
pet_df = pd.DataFrame(pet_dict,index=pet_ids)

pet_df.head()

Using Pandas, we can also rename column names.

In [None]:
pet_df.columns = ['NAME', 'AGE','ANIMAL']
pet_df.head()

Or, we can also change the column names using the rename function.

In [None]:
pet_df.rename(columns={'AGE': 'YEARS'})

In [None]:
# Notice what happens when we print students_df

pet_df

In [None]:
#If you want the file to save over itself, use the option `inplace = True`.
pet_df.rename(columns={'AGE': 'YEARS'}, inplace=True)
pet_df.head()

Similarly, there is a tool to remove rows and columns from your DataFrame

In [None]:
pet_df.drop(columns=['YEARS', 'ANIMAL'])

In [None]:
#Notice again what happens if we print students_df 
pet_df

In [None]:
pet_df.drop(columns=['YEARS', 'ANIMAL'], inplace=True)
pet_df

If you want the file to save over itself, use the option `inplace = True`.

Every function has options. Let's read more about `drop` [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)

### 3. Filtering Data Using Pandas
There are several ways to grab particular data from a DataFrame. 
- Python lists allow for selection of data only through integer location. 
- You can use a single integer or slice notation to make the selection but NOT a list of integers.
- Dictionaries only allow selection with a single label. Slices and lists of labels are not allowed.

In [None]:
l=[1,2,3,4,5]
l[[0,5]]

### DataFrames can be indexed by column name (label) or row name (index) or by position.   
#### The `.loc` method is used for indexing by name.  
#### While `.iloc` is used for indexing by number.

In [None]:
nyc_dogs = pd.read_csv('https://data.cityofnewyork.us/resource/nu7n-tubp.csv') 

In [None]:
nyc_dogs.head()

### Let's take a look at `.iloc`
#### `.iloc` takes slices based on index position.
#### `.iloc` stands for integer location so that should help with remember what it does
#### `.iloc`[row , column]

In [None]:
#returns the first row
nyc_dogs.iloc[0] 

In [None]:
#returns the first column
nyc_dogs.iloc[:,0] 

In [None]:
#returns first two rows notice that ILOC performs regular python slicing.
nyc_dogs.iloc[0:2] 

In [None]:
#returns the first two columns
nyc_dogs.iloc[:,0:2] 

In [None]:
# returns first row and columns 1 and 2
nyc_dogs.iloc[0:1,0:2] 

### How would we use `.iloc` to return the last item in the last row?


In [None]:
#return the last item in the last row using iloc


### How would we use `.iloc` to return the last item in the last column?


In [None]:
#return the last item in the last column using iloc


### What if we only want certain columns or rows?

In [None]:
## Don't do nyc_dogs.iloc[0, 2]
nyc_dogs.iloc[[0,2]]

In [None]:
nyc_dogs.iloc[[0,2],[0,2]] 

### Let's take a look at `.loc`
#### Label based method. 
#### Names or labels of the index is used when taking slices.
#### Also supports boolean subsetting.

In [None]:
# We will use loc to return rows and columns based on labels. Let's look at the nyc_dogs DataFrame again.
nyc_dogs

In [None]:
#returns the dog information associated with index 0
nyc_dogs.loc[0]

In [None]:
#returns the dog information for row index 0 to 2 inclusive.
#note iloc would return normal python slicing not including 2 as demonstrated above.
nyc_dogs.loc[0:2] 

In [None]:
#returns the column labeled 'animalname'
nyc_dogs.loc[:,'animalname'] 

In [None]:
#returns the column labeled 'animalname' and index values 1 to 2.
#gives us the values of the rows with index from 1 to 2 (inclusive)
#and columns labeled age"
nyc_dogs.loc[1:2,'animalname'] 

In [None]:
#returns the column labeled 'age' and index values 1 to 2.
#gives us the values of the rows with index from 1 to 2 (inclusive)
#and columns labeled age to zipcode  (inclusive)"
nyc_dogs.loc[1:2,'animalname':'zipcode'] 

In [None]:
#What should we get?
nyc_dogs.loc[1:2,['animalname', 'zipcode']] 

In [None]:
#How about? 
nyc_dogs.loc[[0,2],['animalname', 'zipcode']] 

## Let's make a new column: age

In [None]:
nyc_dogs['age'] = 2019-nyc_dogs.animalbirth

In [None]:
nyc_dogs.head()

### Boolean Subsetting

In [None]:
nyc_dogs.loc[nyc_dogs['animalname']=='Sam']

In [None]:
nyc_dogs.loc[nyc_dogs['name']=='Sam',['zipcode','state']]

In [None]:
#What amount if we want to select a student of a specific age? 
nyc_dogs.loc[nyc_dogs['age']==2]

In [None]:
#What amount if we want to select a student of a specific age? 
nyc_dogs.loc[(nyc_dogs['age']==2) & (nyc_dogs['animalname']=='Max')]

In [None]:
#What should be returned? 
nyc_dogs.loc[(nyc_dogs['age']==2) & (nyc_dogs['animalgender']=='F')]

### Lesson Recap
Pandas combines the power of python lists (selection via integer location) and dictionaries (selection by label)

`.iloc` is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array.

`.iloc` will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing (this conforms with python/numpy slice semantics).

`.loc` is primarily label based, but may also be used with a boolean array.

#### Warning Note that contrary to usual python slices, both the start and the stop are included.

`.loc` will raise a keyError when any items are not found.

### Pandas
- The data structures in Pandas are implemented using series and dataframe classes.  
- A series is a one-dimensional indexed array of some fixed data type.  
- While a dataframe is a two-dimensional data structure like a table where each column contains data of the same type.  
- DataFrames are great for representing real data: rows correspond to instances (examples, observations, etc.), and columns correspond to features of these instances.


### CLASS ASSIGNMENT
Now that we have all of these new tools in our tool belt, use these tools on the shelter data set! 
- Use `shelter_data.columns` to get the list of column names.
- Subset the data by '`Outcome Subtype`.
- Subset the data by '`Outcome Subtype` `Adoption` and only return the `Animal Type` column. 
- Subset the data by '`Outcome Subtype` `Adoption` and only return the `Animal Type` column with only `Cat`. 
- Play around with your new tools on the data set.
- For extra credit: What are the data types returned from the different subsetting? Is what returned a series or dataframe?

In [29]:
import pandas as pd
shelter_data=pd.read_csv('https://data.austintexas.gov/api/views/9t4d-g238/rows.csv?accessType=DOWNLOAD') 
shelter_data.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
0,A800066,,07/19/2019 11:55:00 AM,07/19/2019 11:55:00 AM,07/15/2017,Transfer,Snr,Cat,Spayed Female,2 years,Domestic Shorthair,Black
1,A800048,,07/19/2019 11:54:00 AM,07/19/2019 11:54:00 AM,07/15/2018,Transfer,Snr,Cat,Intact Male,1 year,Domestic Shorthair,Gray/White
2,A796738,*Schmidt,07/19/2019 11:54:00 AM,07/19/2019 11:54:00 AM,05/04/2019,Adoption,Foster,Cat,Neutered Male,2 months,Domestic Shorthair Mix,Cream Tabby
3,A800058,,07/19/2019 11:54:00 AM,07/19/2019 11:54:00 AM,07/15/2017,Transfer,Snr,Cat,Unknown,2 years,Domestic Shorthair,White/Gray
4,A800025,,07/19/2019 11:54:00 AM,07/19/2019 11:54:00 AM,07/15/2017,Transfer,Snr,Cat,Intact Male,2 years,Domestic Medium Hair,Black/White


In [32]:
outcomesubtypes=shelter_data.Outcome_Subtype.unique()


AttributeError: 'DataFrame' object has no attribute 'Outcome_Subtype'

In [None]:
outcome_sub=shelter_data.loc[shelter_data['Outcome Subtype']=='Normal']
outcome_sub

## Assessment & Reflection

- One thing you did not know before?
- Two things you want to remember?
- One thing you're still confused by?

### EXTRA CREDIT

- Read in the csv `map_zip_nyc_hood.csv`
- create subsets (new datasets) of the dataset by borough 
- using only for loops, subsets, string operators, join, split, etc, create a unique list of zip codes by borough
- create a new column on the dogs_nyc dataframe called 'borough' - and use `if` statements and `in` logic to assign the new variable from your new lists.


**Question**: Using `shape` and filtering, how does the # of neutered vs un-neutered dogs differ by borough?


No *merging*, *joining*, *lambdas*, or *apply/map* functions. Those are for Monday :)