# Jupyter intro

Editing in Jupyter

When you press shift+ENTER, that forces the cell to be executed, meaning that the Python code is run or the Markdown is displayed nicely.

#Syllabus

1. Intro to analytics in Python
    - What is Pandas? How does it relate to libraries like NumPy?
    - Basics of reading data from a CSV file
    - Basics of analysis and some methods
    - Simple visualizations
2. Pandas series (1D data structures)
   - Creating a series
   - Analyzing a series
   - Broadcasting
   - What *not* to do if you're an experienced Python programmer
3. Mask/boolean arrays
   - Retrieving from a series with booleans
   - Using a mask array to select which items we want
4. Index
   - Setting the index
   - Retrieving using the index
   - Multi-indexing (a little bit)
5. Dtype
   - What are dtypes?
   - Choosing a dtype
   - Changing dtypes
   - 'NaN' ("not a number") and working with it
6. Reading data from a file
   - Turning a file into a "data fram" with 2D (rows and columns)
   - CSV files
   - Retrieving rows
   - Retrieving columns
7. Different data types
   - Excel
   - JSON
   - Retrieving resources from the Internet
   - Scraping Web sites
8. Sorting
9. Grouping and Pivot Tables
10. Cleaning your data
11. Working with text
12. Dates and times
13. Visualization
    - Charts
    - Plots
14. New trends, and where to go from here

#What is Pandas?

NumPy and Pandas.

NumPy is a Python module that's 95% written in C. It basically exposes C data structures to us via a layer of Python. We can thus benefit from the ease of Python but the speed and efficiency of C. 

NumPy is very low level. You get the data structure, and then you have to do the work yourself. Pandas provides us with lots of convenient methods to work with NumPy at a higher level. Pandas knows how to perform many more calculations, many more advanced functions with strings, dates, and plotting. Also, retrieving and setting data more easily.

Pandas allows us to:
- Read data from a wide variety of formats and sources
- Clean the data
- Analyze the data in numerous ways
- Write our analysis out to different formats and outputs
- Create visualizations of our work in charts and graphs

#Using Pandas

If you want to work with Pandas, you'll have to load it as a Python module using 'import'.
    import pandas as pd
Everyone uses that alias!

In [2]:
import pandas as pd

#Loading data 

If I want to load data from a file into Pandas, what I'm really saying is: There is a file on a disk that contains information in a format that Pandas knows -- most often, in CSV format. 

EXERCISE: LOAD THE TAXI DATA

In [11]:
filename = '../data/taxi.csv'   #the file I want to read from is in the parallel "data" directory
taxi_df = pd.read_csv(filename)     #give me a data frame based on this data

In [13]:
type(taxi_df) # what kind of value does df refer to?

pandas.core.frame.DataFrame

In [14]:
#how big are you, in terms of rows and columns?
taxi_df.shape

(9999, 19)

In [15]:
taxi_df.head() #show me the first five rows of this data frame, df

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.95443,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.0,0.0,0.3,17.8
1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.0,0.0,0.3,8.3
2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.2,0.0,0.3,11.0
3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.76033,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.0,0.0,0.3,10.3


In [16]:
taxi_df.tail()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
9994,1,2015-06-01 00:12:59,2015-06-01 00:24:18,1,2.7,-73.947792,40.814972,1,N,-73.973358,40.783638,2,11.0,0.5,0.5,0.0,0.0,0.3,12.3
9995,1,2015-06-01 00:12:59,2015-06-01 00:28:16,1,4.5,-74.004066,40.747818,1,N,-73.953758,40.779285,1,16.0,0.5,0.5,3.0,0.0,0.3,20.3
9996,2,2015-06-01 00:13:00,2015-06-01 00:37:25,1,5.59,-73.994377,40.766102,1,N,-73.903206,40.750546,2,21.0,0.5,0.5,0.0,0.0,0.3,22.3
9997,2,2015-06-01 00:13:02,2015-06-01 00:19:10,6,1.54,-73.978302,40.748531,1,N,-73.989166,40.762852,2,6.5,0.5,0.5,0.0,0.0,0.3,7.8
9998,1,2015-06-01 00:13:04,2015-06-01 00:36:33,1,5.8,-73.983215,40.726414,1,N,-73.924133,40.701645,1,21.0,0.5,0.5,4.45,0.0,0.3,26.75


In [17]:
taxi_df.head(3)

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.95443,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.0,0.0,0.3,17.8
1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.0,0.0,0.3,8.3
2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.2,0.0,0.3,11.0


In [21]:
#let's get data from one of the columns
#we can do that by using [] and putting the column name inside of them
#this feels a lot like retrieving from a dict

#2D data -- data frame
#1D data, including each column, is a series

taxi_df['trip_distance']

0       1.63
1       0.46
2       0.87
3       2.13
4       1.40
        ... 
9994    2.70
9995    4.50
9996    5.59
9997    1.54
9998    5.80
Name: trip_distance, Length: 9999, dtype: float64

In [19]:
#given this series, what can we calculate on it?
#Pandas has defined a lot of methods that come in handyt

In [22]:
taxi_df['trip_distance'].min()

0.0

In [24]:
taxi_df['trip_distance'].max()

64.6

In [25]:
taxi_df['trip_distance'].mean()

3.1585108510851083

In [26]:
#A way to get a data summary

taxi_df['trip_distance'].describe()

count    9999.000000
mean        3.158511
std         4.037516
min         0.000000
25%         1.000000
50%         1.700000
75%         3.300000
max        64.600000
Name: trip_distance, dtype: float64

In [27]:
taxi_df['trip_distance'].tail()

9994    2.70
9995    4.50
9996    5.59
9997    1.54
9998    5.80
Name: trip_distance, dtype: float64

#Jupyter has magic commands

Jupyter adds a whole lot of commands to our experience, and thewy will start with '%', so that Python won't get confused. One of them is '%who', which gives you a list of variables. You should probably use %whos, which gives you a little table of variable names, types and values.

You can find out what magic commands are available with %magic.

In [32]:
%whos

Variable   Type         Data/Info
---------------------------------
filename   str          ../data/taxi.csv
pd         module       <module 'pandas' from '/h<...>ages/pandas/__init__.py'>
taxi_df    DataFrame          VendorID tpep_picku<...>n[9999 rows x 19 columns]


In [29]:
del(df) #remove the variable name, and when the number of references to the data frame goes to 0, the memory is freed up

EXERCISE: ANALYSIS OF TOTAL AMOUNT

Our taxi file has a 'total_amount' column, indicating the total amount that the person needs to pay. I want you to:

- Get the first 5 values. What is the mean of the first 5 values in the column?
- Run describe on that column. How are the 