# Background on Python, iPython, Jupyter and Pandas

[Python](https://www.python.org/) is a high-level general purpose programming language named after a [British comedy troup](https://www.youtube.com/user/MontyPython), created by a [Dutch programmer as a hobby project](http://en.wikipedia.org/wiki/Guido_van_Rossum) and maintained by an international group of friendly but opinionated python enthusiasts (`import this!`). Until June 2018, Guido van Rossum was the [Benevolent dictator for life](https://en.wikipedia.org/wiki/Benevolent_dictator_for_life) for the Python language, now decisions are made jointly by the Python Steering Council.

Python is popular for data science because it's powerful, fast, plays well with others, runs everywhere, is easy to learn, highly readable, and open. Because it's general purpose it can be used for full-stack development. It's got a growing list of useful libraries for scientitic programming, data manipulation, data analysis. (Numpy, Scipy, Pandas, Scikit-Learn, Statsmodels, Matplotlib, Pybrain, etc.)

[iPython](http://ipython.org/) is an enhanced, interactive python interpreter started as a grad school project by [Fernando Perez](http://fperez.org/). iPython (jupyter) notebooks allow you to run a multi-language (Python, R, Julia, Markdown, LaTex, etc) interpreter in your browser to create rich, portable, and sharable code documents.

[Pandas](http://pandas.pydata.org/) is a libary created by [Wes McKinney](http://blog.wesmckinney.com/) that introduces the R-like dataframe object to Python and makes working with data in Python a lot easier. It's also a lot more efficient than the R dataframe and pretty much makes Python superior to R in every imaginable way (except for ggplot 2). 

## Getting started with Jupyter (iPython) Notebooks

To start up a Jupyter notebook server, simply navigate to the directory where you want the notebooks to be saved and run the command

```
jupyter notebook
```

A browser should open with a notebook navigator. Click the "New" button and select "Python 3".

A beautiful blank notebook should open in a new tab

Name the notebook by clicking on "Untitled" at the top of the page.

Notebooks are squences of cells. Cells can be markdown, code, or raw text. Change the first cell to markdown and briefly describe what you are going to do in the notebook. 

## Getting started with Pandas

We start by importing the libraries we're going to use: `pandas` and `matplotlib`

In [None]:
# Import Statements
import pandas as pd
import numpy as np
%matplotlib inline

In [None]:
crimes = pd.read_csv('chicago_past_year_crimes.csv')

## Loading data into a Pandas DataFrame

So far we've been working with raw text files. That's one way to store and interact with data, but there are only a limited set of functions that can take as input raw text. Python has an amazing array of of data structures to work with that give you a lot of extra power in working with data. 

Built-in Data Structures
- strings ""
- lists []
- tuples ()
- sets {}
- dictionaries {'key':value}

Additional Essential Data Structures

- numpy arrays ([])
- pandas Series
- pandas DataFrame
- tensorflow Tensors


Today we'll primarily be working with the pandas DataFrame. The pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes. It's basically a spreadsheet you can program and it's an incredibly useful Python object for data analysis. 

You can load data into a dataframe using Pandas' excellent `read_*` functions.

We're going to try two of them: read_table & read_csv

Pro tip: TAB COMPLETION!

Pro tip: jupyter will pull the doc string for a command just by asking it a question.

Pro tip: jupyter will give you the allowable arguments if you hit `shift + tab`

## Viewing data in pandas

There are lots of options for viewing data in pandas. Just like we did in the command line, you can use `head` and `tail` to get a quick view of our data.

In [None]:
# Look at the head of the dataframe
crimes.head()

In [None]:
#Now try the same thing, but for the end of the dataframe, called the "tail". 

crimes.tail()

What's the shape of the data?

In [None]:
crimes.shape

What does this ^^ mean?

Now, let's check out the datatypes using [.dtypes](http://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.dtypes.html).

In [None]:
# Enter code to look at datatypes
crimes.dtypes

Pro tip: you'll notice that some commands have looked like pd.something(), some like data.something(), and some like data.something without (). The difference is a pandas function or class vs methods vs attributes. Methods are actions you take on a dataframe or series, while attributes are descriptors or the dataframe or series.


## Modifying your dataframe

Notice that we have some issues when it comes to our column names. Let's fix that by learning how to edit and delete columns.

In [None]:
crimes.columns

Notice that some of the column names have spaces at the start or end of the name. Let's remove those so that 

In [None]:
# remove white spaces
crimes.columns = crimes.columns.str.strip()
crimes.columns

In [None]:
# replacing spaces with underscore
crimes.columns = crimes.columns.str.replace(' ', '_')
crimes.columns

In [None]:
# We'll should also remove the double occurence of "_" in DATE__OF_OCCURENCE. Do so below:

# New Code here

crimes.columns = crimes.columns.str.replace('__', '_')

# crimes.columns

The `LOCATION` Column seems redundant, seeing that we also have `X_COORDINATE` and `Y_COORDINATE` columns. 

Let's drop it using [.drop](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html), axis=1 (for columns), and inplace=True.

In [None]:
# write code to drop column below:

crimes.drop('LOCATION', axis=1, inplace=True)

In [None]:
crimes.columns

Task: Rename columns 'CASE#' to 'CASE_ID'.

Hint: Google "pandas rename column"

## Describing the entire dataframe

Now that we have columns, we want to get a better global view of our data. There are several ways 

In [None]:
crimes.describe()

In [None]:
crimes.describe(include=['O'])

In [None]:
crimes.isnull().sum()

## Selecting and subsetting in pandas

One of the biggest benefits of having a multi-index object like a DataFrame is the ability to easily select rows, columns, and subsets of the data. Let's learn how to do that.

First we will select individual series from the dataframe.

In [None]:
crimes['PRIMARY_DESCRIPTION'].head()

In [None]:
#using . notation
crimes.PRIMARY_DESCRIPTION.head()

In [None]:
# get value counts
crimes.PRIMARY_DESCRIPTION.value_counts()

In [None]:
# selecting two columns

# enter code to select two columns. It will look something like: dataframe[[column 1, column 2]]
crimes[['PRIMARY_DESCRIPTION', 'SECONDARY_DESCRIPTION']].head()

In [None]:
#subset by row index
crimes.PRIMARY_DESCRIPTION[3:10]

In [None]:
crimes.PRIMARY_DESCRIPTION[5:16]

In [None]:
#Use the iloc method 
crimes.iloc[10:20,4:6]

### Now let's subset on row values.

In [None]:
#Create a boolean series based on a condition
theft_bool = crimes['PRIMARY_DESCRIPTION']=='THEFT'
theft_bool

In [None]:
#now pass that series to the datafram to subset it
theft = crimes[theft_bool]
theft.head()

In [None]:
#now pass that series to the datafram to subset it
theft = crimes[crimes['PRIMARY_DESCRIPTION']=='THEFT']
theft.head()

In [None]:
## Now, create a dataframe that only has the subset of crimes with the primary description 'CRIMINAL DAMAGE'

# code here:

damage = crimes[crimes['PRIMARY_DESCRIPTION']=='CRIMINAL DAMAGE']
damage.head()

In [None]:
## Now try creating a dataframe that has the subset of crimes with the primary description CRIMINAL DAMAGE 
# and the secondary description TO PROPERTY

# Will need to google how to use & operator with pandas

crimes[(crimes['PRIMARY_DESCRIPTION']=='CRIMINAL DAMAGE')&(crimes['SECONDARY_DESCRIPTION']=='TO PROPERTY')].head()


### Sorting

In [None]:
theft.sort_values('DATE_OF_OCCURRENCE', inplace=True, ascending=False)
theft.head()

In [None]:
# For a quick check, let's also look at this where it is ascending

# code here:
theft.sort_values('DATE_OF_OCCURRENCE', inplace=True, ascending=True)


In [None]:
theft.head()

Hmmm. Something isn't right about how this is sorting. Let's look into it.

In [None]:
theft.dtypes

Right now the dates are objects. To ensure they're handled correctly, they should be datetime. Let's fix that!

In [None]:
theft.DATE_OF_OCCURRENCE = pd.to_datetime(theft.DATE_OF_OCCURRENCE)

In [None]:
theft.head()

In [None]:
theft.sort_values('DATE_OF_OCCURRENCE', inplace=True, ascending=True)
theft.head()

### loc vs iloc
You can see that the row labels for the first 5 rows are NOT 0, 1, 2, 3, and 4. If we wanted to select the first five rows, we can use `DataFrame.iloc[]` method to select by position. If you want to select the rows with labels 0 through 4, you would use `DataFrame.loc[]`. 

The easy way to remember which is which is to remember that `iloc[]` stands for integer location, because you use integers and not labels to select the data.

In [None]:
#print the first five rows of theft data


In [None]:
#print first ten rows of theft data


In [None]:
#print the rows with index label 12


In [None]:
# print the row at the fifth position


## Applying functions to series and creating new columns

In [None]:
scores = pd.read_csv('fandango_score_comparison.csv')

In [None]:
scores.head()

In [None]:
scores.describe()

In [None]:
scores.info()

In [None]:
scores.IMDB.mean()

In [None]:
scores.IMDB.describe()

In [None]:
# find the max of IMDB column

max_IMDB = scores.IMDB.max()

In [None]:
# find the min of the IMDB column

min_IMDB = scores.IMDB.min()

In [None]:
#Return the list of movies with the lowest score:

scores[scores.IMDB == scores.IMDB.min()]

In [None]:
# Return the list of movies with the highest score:




In [None]:
# Movies with the highest RottenTomatoes rating



In [None]:
# Movies with the lowest RottenTomatoes rating



Now we can plot the series with ease!

## Groupby!
Groupby is a powerful method that makes it easy to peform operations on the dataframe by categorial values. Let's try generating a plot of min, max, and average temp over time.

In [None]:
crimes.groupby('PRIMARY_DESCRIPTION').size()

In [None]:
crimes.groupby('BEAT').mean()

In [None]:
# How would we do the same, but groupby both primary and sedondary description?

crimes.groupby(['PRIMARY_DESCRIPTION', 'SECONDARY_DESCRIPTION']).size()

In [None]:
crimes.head()

### Let's group by year and month to look at data on a month to month basis.

In [None]:
crimes.DATE_OF_OCCURRENCE = pd.to_datetime(crimes.DATE_OF_OCCURRENCE)

In [None]:
# Create a new year column
crimes['year'] = crimes.DATE_OF_OCCURRENCE.map(lambda x: x.year)

In [None]:
#Create a new month column

# code here
crimes['month'] = crimes.DATE_OF_OCCURRENCE.map(lambda x: x.month)

In [None]:
crimes.head()

In [None]:
# Groupby the new year and month columns

month_year_crimes = crimes.groupby(['year', 'month']).size()

In [None]:
type(month_year_crimes)

Plots!!

In [None]:
month_year_crimes.plot()

In [None]:
month_year_crimes.hist()