# Lecture 2: Pandas Introduction, DataFrame from CSV
Pandas is a popular Python package for data science, and with good reason: it offers powerful, expressive and flexible data structures that make data manipulation and analysis easy, among many other things. The DataFrame is one of these structures.

Pandas DataFrames make manipulating your data easy, from selecting or replacing columns and indices to reshaping your data. You won't have to write functions yourself to read and write data as we did in the first lecture.

The best intro guide, but it's big, is 10 Minutes to Pandas: http://pandas.pydata.org/pandas-docs/stable/10min.html
Check the website on the library here : https://pandas.pydata.org. You can also find documentation in this website.

This is also a good <a href="https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python?utm_source=adwords_ppc&utm_campaignid=898687156&utm_adgroupid=48947256715&utm_device=c&utm_keyword=&utm_matchtype=b&utm_network=g&utm_adpostion=1t1&utm_creative=229765585183&utm_targetid=dsa-473406573835&utm_loc_interest_ms=&utm_loc_physical_ms=9056017&gclid=Cj0KCQjwh8jrBRDQARIsAH7BsXcYZ9E60iAAQ9t8VhRS_TyfOtaM571VNV8EGKby9wgj9An4b1SZepgaAioSEALw_wcB"> tutorial on Pandas </a>
    

Let's start to work with pandas, to make analysis of this data easier.  Our convention is to import as "pd":

In [None]:
import pandas as pd

# numpy is a library very useful in Python for numerical stuff
import numpy as np

# two librairies for plotting
import matplotlib.pyplot as plt
import seaborn 
%matplotlib inline

## Reading/Writing files<a name="1. Reading/Writing files"></a>
### Reading<a name="1.1 Reading"></a>

In this notebook we are going to use two different datasets : one with more 'qualitative' information and one with 'quantitative' info. 

In [None]:
# Pandas has built-in tools to read files, including csv and excel.
complaints = pd.read_csv('data/311-service-requests.csv')
bikes = pd.read_csv('data/bikes.csv', sep=';', encoding='latin1', parse_dates=['Date'], dayfirst=True, low_memory=False)

Depending on your pandas version, you might see an error like "DtypeWarning: Columns (8) have mixed types". This means that it's encountered a problem reading in our data. In this case it almost certainly means that it has columns where some of the entries are strings and some are NaN which are recognized as float by the funciton. 

For now we're going to ignore it and hope we don't run into a problem, but in the long run we'd need to investigate this warning.



In [None]:
complaints
#bikes

Check the type of the data and see for yourself

In [None]:
# type of the results
type(complaints)

** The function returns a DataFrame which is an object with a lot of things defined on it **

### Writing<a name="1.2 Writing"></a>

*We can also save out a csv file from pandas in a simple way*


In [None]:
complaints.to_csv("data/saved_complaints.csv")

*What if we only wanted some of the columns? We can pick which ones to write out*


In [None]:
complaints.to_csv("data/saved_complaints2.csv", columns=["Created Date", "Closed Date"])

*If you don't want the extra index row with the rownumber and no header, you can prevent that from being saved by saying "index=None" and "header = None)*


In [None]:
complaints.to_csv("data/saved_complaints3.csv", columns=["Created Date", "Closed Date"], index=None, header=None)

## Viewing Data<a name="2. Viewing Data"></a>
### See the top & bottom rows of the frame<a name="2.1. See the top & bottom rows of the frame"></a>

This command shows us the result of the operation, with the top 5 rows of the table. You can specify the number of rows you want to show inside the brackets.

In [None]:
complaints.head(3)  
#bikes.head()

To show the bottomm rows: 

In [None]:
complaints.tail(3)
#bikes.tail()

### Plotting data
Since complaints contain many qualitative data, lets first experiment with the bikes data to experiment with plotting and visualizing.

In [None]:
bikes.head()

To make a plot from pandas, we use the dataframe object we created, then tell the plot function
what to use as the X column and what to use as the Y column.

In [None]:
bikes.plot( 'Date',"Maisonneuve 1",figsize=(10,5))

## Selection
Lets now experiment with data selection.

*Let's use complaints data again*


In [None]:
complaints.head(3)

### Get a slice from a DataFrame
#### Get a specific column

**Selection by colomn name**

In [None]:
complaints["Agency"]

In [None]:
complaints.Agency

**If you want to select certain rows**

All rows

In [None]:
complaints.loc[:,'Agency']

Some rows

In [None]:
complaints.loc[1:3,'Agency']

**Selection by position**

In [None]:
# comment/uncomment lines
complaints.iloc[:,3]

#### Get a line

In [None]:
complaints.head(1)

In [None]:
# comment/uncomment lines
complaints.iloc[1,:]
#complaints.loc[1,:] #Selection by Label

Type of columns or rows

In [None]:
print(type(complaints["Agency"]))
print(type(complaints.iloc[1,:]))

** A Slice of shape (1,) or (,1) is not a DataFrame but a Serie. The associated functions are not exactly the same **

In [None]:
# Possible to build a DataFrame from a serie
pd.DataFrame(complaints["Agency"]).head()

#### Select multiple rows and columns

Sometimes you want to combine your selection to rows and columns.  You can do that with `.loc[]` and `.iloc[]`.

`iloc[rows,colums]` is for use with **number selectors** for row and column, and `loc[rows,columns]` is for **label selectors** for row and column (if they exist).

The documentation for this is [here](http://pandas.pydata.org/pandas-docs/stable/indexing.html).

If you use these, you must put a row selector first, then a column selector:

In [None]:
#Selecting by integer is done with iloc:
# this selects the first 10 rows and the first 3 columns.
complaints.iloc[0:10, 0:3]

select rows 0:5, but the columns between the named columns here:
Note, this works with row numbers because there is no label for the rows aside from numbers.
Notice this command loc is "inclusive" of the end points on the range.  Meaning it includes them.

In [None]:
complaints.loc[0:5, 'Closed Date':'Complaint Type']

What if we just want to know the complaint type and the borough, but not the rest of the information? Pandas makes it really easy to select a subset of the columns: just index with list of columns you want, using double square brackets (Or using the `loc[]` method above):

In [None]:
#complaints[['Complaint Type', 'Borough']]
complaints.loc[:,['Complaint Type', 'Borough']]

### Boolean Indexing
Boolean indexing is useful when we need to select data based on a certain criteria. For example, if we only want to select data where the value of the Unique Key colomn is greater than 26595140:

Check which data satisties the condition:

In [None]:
complaints['Unique Key'] > 26595140

In [None]:
complaints[complaints['Unique Key'] > 26595140].head()

Use Function isin() to select the data rows that contain specific values in a specific column.
For example this command selects all rows that have Complaint Type is equal to the values "Noise - Vehicle" and "Noise - Street/Sidewalk"

In [None]:
complaints[complaints['Complaint Type'].isin(['Noise - Vehicle','Noise - Street/Sidewalk'])]

**Using list of Booleans**

Check when values in a specific column is equal to a specific value.

In [None]:
complaints['Complaint Type'] == 'Noise - Street/Sidewalk'

This is a big array of `True`s and `False`s, one for each row in our dataframe. When we index our dataframe with this array, we get just the rows where our boolean array evaluated to `True`.  It's important to note that for row filtering by a boolean array the length of our dataframe's index must be the same length as the boolean array used for filtering.

In [None]:
complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk' ]

You can also combine more than one condition with the `&` operator like this:

In [None]:
is_noise = complaints['Complaint Type'] == "Noise - Street/Sidewalk"
in_brooklyn = complaints['Borough'] == "BROOKLYN"
# now, both must be true since we use & here:
complaints[~(is_noise & in_brooklyn)].iloc[:10,:]

Or, to limit the columns we return -- 


In [None]:
complaints[is_noise & in_brooklyn][['Complaint Type', 'Borough', 'Created Date', 'Descriptor']]


## Get information about data
### Display the index, columns, and the underlying numpy data

In [None]:
complaints.head(1)

In [None]:
complaints.index

To get the index of the rows that satisfy a certain condition use .index

In [None]:
complaints[is_noise & in_brooklyn][['Complaint Type', 'Borough', 'Created Date', 'Descriptor']].index

In [None]:
complaints.columns

In [None]:
complaints.values

To get the types of the data columns in the dataframe

In [None]:
complaints.dtypes

### Describe data
The describe function summarizes statistics of the numerical data present in the DataFrame.

 - mean : average
 - std : standard deviation
 - Maximum
 - Minimum
 - Quantile 25 50 75 
 

In [None]:
bikes.describe()
# Only the numeric columns

*Now we can see the min and max, and check if we were right!*

In [None]:
# We define a new function get_highest for DataFrame
def get_highest(data, name_column):
    highest = 0
    for value in data.loc[:,name_column]:
        if float(value) > highest:
            highest = float(value)
    return highest

In [None]:
get_highest(bikes, 'Maisonneuve 1')

#### Plotting histograms

In [None]:
bikes.hist(figsize = (20,10))

In [None]:
# histogram of a specific column
bikes['Maisonneuve 1'].hist()

### Simple stats on Numerical columns

In [None]:
bikes['Côte-Sainte-Catherine'].sum()

In [None]:
bikes['Côte-Sainte-Catherine'].mean()

In [None]:
bikes['Côte-Sainte-Catherine'].max()

In [None]:
bikes['Côte-Sainte-Catherine'].min()

In [None]:
bikes['Côte-Sainte-Catherine'].median()

## Reorganizing data
### Transposing data 

In [None]:
bikes.T

### Sorting data

In [None]:
bikes.sort_values(by = 'Berri 1', ascending=False).head()
# add arg inplace = True to perform operation in-place

In [None]:
# it's possible to sort a column with the same function
bikes['du Parc'].sort_values().head()

## Transforming data
### Transforming existing data
*To avoid transforming df, we first make a copy*


In [None]:
complaints_copy = complaints.copy()
# df_copy = df only creates a alias, not a new object !
bikes_copy = bikes.copy()

*You just have to assign new values to a selection*

Lets assign the values 1 to 3 to the first "Unique Key" rows


In [None]:
complaints_copy.loc[0:2,'Unique Key'] = np.array([1,2,3])
complaints_copy.head()

### Adding new data

Lets add the column Daily Total to our data and assign each value as the sum of the others 

In [None]:
# Adding a column
bikes_copy['Daily Total'] = bikes['Berri 1'] + bikes['Côte-Sainte-Catherine'] + bikes['Maisonneuve 1'] + bikes['Maisonneuve 2']+ bikes['du Parc']+ bikes['Pierre-Dupuy']+ bikes['Rachel1']
bikes_copy.head()

### Dropping data
- To delete a row or a column use the function drop(). 
- To drop a column set axis = 1 
- To delete the column without having to reassign like this: bikes_copy = bikes_copy.drop()
--> you can set the parameter inplace = True


In [None]:
bikes_copy.drop('Brébeuf (données non disponibles)', axis = 1, inplace = True)
bikes_copy.head()

Finally, to drop by column number instead of by column label, try this to delete the 2nd and 3rd columns

In [None]:
bikes_copy.drop(bikes_copy.columns[[2, 3]], axis=1,inplace = True)  # df.columns is zero-based pd.Index 
bikes_copy.head()

### Sorting values

To sort the values of the columns with respect to a certain column: sort_values(by = '', ascending= )

In [None]:
bikes_copy = bikes_copy.sort_values(by = 'Daily Total',ascending=False)
bikes_copy.head()

After sorting To reset the values, the indices of the rows will be sorted as well. To deal with this we used the reset_index() function. drop = True means not to keep the old index

In [None]:
bikes_copy.reset_index(inplace=True, drop = True)
bikes_copy.head()


### Applying a Function to Each Column

Sometimes it is necessary to apply the same function to certain columns of the data. To do this we use apply().
Lets for example try to apply a function that divides each value starting from the 3rd column by the values of the column Daily Total and put it in a DataFrame named proportions. Notice that we did not put inplace=True and so 
we had to define a new DataFrame called proportions and put the values in it. 

In [None]:
proportions = bikes_copy.iloc[:,2:].apply(lambda x: x/x['Daily Total'], axis=1)
proportions.head()

### Dealing with NA

Missing values is a common problem in data analysis. Missing values are usually automatically replaced by NaN values. To clean our data we can either delete the rows containing NaN or replacing NaNs by specific values.  

Lets check if we have NaN values in the bikes data. For that we use isnull() function. The result is True is the value is Nan and False if it is not.

In [None]:
# To get the boolean mask where values are NaN 
bikes_copy.isnull().head()

use fillna() function to replace the NaN values by whatever value you choose.
value can be a string, a number or even a function

In [None]:
# Filling missing data
bikes_copy.fillna(value = 0, inplace=False).head()

In [None]:
bikes_copy.head()

Lets replace the NaN values in the Closed Date column in complaints by the string "Unknown"

In [None]:
complaints.head(3)

In [None]:
complaints_copy['Closed Date'].fillna(value='Unknown', inplace= True)
complaints_copy.head(3)

To delete all the rows containg NaN, use the dropna() function
- axis where missing values are dropped, 0 <=> row and 1 <=> column
- how : if 'any' NaN are present, drop that label 
- if 'all' labels are NA, drop that label

In [None]:
complaints.dropna(axis = 0, how = 'all', inplace = False).head()

## Counting Values And Filtering<a name="_counting values and filtering"></a>

To count the occurence of specific values for specific columns. We use the `.value_counts()` function.
What's the most common complaint type? This is a really easy question to answer! 

In [None]:
complaints['Complaint Type'].value_counts()

If we just wanted the top 10 most common complaints, we can do this:

In [None]:
complaint_counts = complaints['Complaint Type'].value_counts()
complaint_counts[:10]

Transform the result to a dictionary.

In [None]:
dict(complaint_counts)

But it gets better! We can plot them!

In [None]:
complaint_counts[:10].plot(kind='bar')

We can also see how many unique types of complaints there are in the column, using `.unique` and len (for length):

In [None]:
complaints['Agency'].unique()

In [None]:
# get the number of unique values for a feature
len(complaints['Agency'].unique())

## Application : Which Borough has the Most Noise Complaints?<a name="_ 8. application : which borough has the most noise complaints?"></a>

To get the noise complaints, we first need to find the rows where the "Complaint Type" column is "Noise - Street/Sidewalk". We use boolean indexing (3.2.)

In [None]:
noise_complaints = complaints[  complaints['Complaint Type'] == "Noise - Street/Sidewalk"  ] 
noise_complaints[0:10]
noise_complaints['Borough']

In [None]:
noise_complaints['Borough'].value_counts().plot(kind = 'bar')

It's Manhattan! But what if we wanted to divide by the total number of complaints, to make it make a bit more sense? That would be easy too:

In [None]:
complaints['Borough'].value_counts()

In [None]:
results = noise_complaints['Borough'].value_counts() / complaints['Borough'].value_counts()
results

What if we want to graph this?
The default is a line graph, which is not the right type for this kind of data.  This data is not timeseries (where the X axis is dates/times).  This data is count data by categories that are not ordered -- "boroughs."  The proper type of chart for this is a bar graph.

In [None]:
results.plot()

In [None]:
# we can use the keyword argument kind for this:
results.sort_values(ascending=False).plot(kind = 'bar')

# In-Class Exercises
## Part 1: Filtering and Counting Things


In [None]:
df = pd.read_csv('data/chicago_crimes.csv', parse_dates=['Date-Time'], dayfirst=False,)

In [None]:
df.describe()

The problem above is that some of those columns have numbers, but not measures you can do math on. The Beat, District, Ward, Community Area, and id are codes, so you can't take their mean, avg, max etc...  We should change them to string types, of "object" in pandas type notation.

In [None]:
df = df.astype({"Ward":"object","Beat":"object", "District":"object", "identification":"object", "Community Area": "object"}, copy=True)

In [None]:
df.describe()

In [None]:
df.head()

### What are the columns in this data set?


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### What are all the Primary Types?  
Hint: use unique()


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Find out how many of each type occur.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Make a bar chart of the top 10

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Make a new dataframe of just the theft types.


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Get the counts of each Description type inside the Thefts dataframe.


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Using your answer above, make a bar chart of the Description type counts.


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### What percentage of all crimes in this data occur in Ward 42? 
Hint: just find out how many occur in ward 42 and then divide by all crimes.


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### What is the most common crime Primary Type that results in an arrest?


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Part 2: Statistics Computation
### Load in the paris rainfall data

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
rain.describe()

In [None]:
rain.tail()

### Use .loc to get all the rain values for the year 1700


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Sort by rain amount in January, ascending=False. Now get the top row. What's the year?


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Find the values for the month of June. Plot in a bar chart. (It's ok if you can't read the x axis labels.)


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### What are the max and min values for Jun?  What is the median?


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Make a histogram for one month's rain.


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Create a column for the total rain each year.  
It's okay if you have a NaN for years with columns with missing data, too.


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Plot the proportion or percentage of rain for each January out of the total (hint: jan/total), using a line chart.


In [None]:
# YOUR CODE HERE
raise NotImplementedError()