In [1]:
print('Hello, world!')   # if I type into Jupyter, it's considered to be Python code
 
    
# I can type whatever I want
# when I want to execute the cell, I use shift+Enter

Hello, world!


# Exercise: Set up a new notebook

1. Go to https://pharo.lerner.co.il
2. Navigate to the "Monday group" folder
3. Create a new Python 3 notebook
4. Click on the title, and rename it to reflect your name and the date
5. Write a trivial piece of Python code, and execute it with shift+Enter

# Very fast intro to Jupyter

Everything in Jupyter happens in a "cell." I'm typing into a cell right now. When we type into a cell, it can be in one of two modes:

- Edit mode (like right now), where text goes into the cell. I can enter edit mode by clicking on the cell's contents or by pressing ENTER. This allows me to type.
- Command mode, where typing tells Jupyter what to do, typically with one-character commands. You can enter command mode by clicking to the left of the cell or by pressing ESC.

What commands do I have in command mode?
- `c` -- copy the current cell
- `x` -- cut the current cell
- `v` -- paste the most recently cut/copied cell
- `a` -- create a new cell *above* the current one
- `b` -- create a new cell *below* the current one
- `y` -- turn the current cell into a Python code cell
- `m` -- turns the current cell into Markdown text formatting (like right now)

# Very fast intro to Jupyter

Everything in Jupyter happens in a "cell." I'm typing into a cell right now. When we type into a cell, it can be in one of two modes:

- Edit mode (like right now), where text goes into the cell. I can enter edit mode by clicking on the cell's contents or by pressing ENTER. This allows me to type.
- Command mode, where typing tells Jupyter what to do, typically with one-character commands. You can enter command mode by clicking to the left of the cell or by pressing ESC.

What commands do I have in command mode?
- `c` -- copy the current cell
- `x` -- cut the current cell
- `v` -- paste the most recently cut/copied cell
- `a` -- create a new cell *above* the current one
- `b` -- create a new cell *below* the current one
- `y` -- turn the current cell into a Python code cell
- `m` -- turns the current cell into Markdown text formatting (like right now)

In [2]:
x = 100
y = 200

print(x+y)

300


In [3]:
x

100

In [4]:
y

200

In [5]:
# if the final line of a Python cell is an expression, and if it returns a non-None value
# then it simply returns and displays the expression's value

In [6]:
x * y

20000

# Syllabus

1. Intro to analytics with Python
    - What is Pandas?
    - Basics of reading data 
    - Analyzing data
    - Visualizing data
2. Pandas series (1-dimensional data)
    - Analyzing with a series
    - Retrieving from a series
    - Broadcasting
    - What *not* to do if you're a good Python programmer!
3. Mask arrays
    - Selective retrieval
    - This is what we use instead of loops in Pandas
4. Indexes
    - Retreiving in various ways
5. Dtypes -- data types behind the scenes
    - What are they?
    - Changing/converting dtypes
    - `NaN` ("not a number")
6. Reading data from a file 
    - CSV files
    - Retrieving rows
    - Retreiving columns (data frame -- 2D data in Pandas)
7. Different data sources
    - Excel
    - JSON
    - Grabbing data from APIs
    - From Web pages
8. Sorting
9. Grouping and pivot tables
10. Cleaning data
11. Working with text
12. Dates and times
13. Visualization
    - Plotting
    - Charts


# What is Pandas?

Isn't Python a terrible language for doing data analysis?

The secret is that yes, if we were to use Python's integers and floats for our data anlaysis, we would quickly run out of memory and/or happily be charging by the hour.

A number of years ago, people invented NumPy, which gives you a thin layer of Python and a deep data structure in C. You get the advantage of programming in Python, but with the speed/efficiency of C.

Pandas is a Python package that acts as a wrapper around NumPy, making it easier/friendlier and with more functionality.

Pandas allows us to:
- Read data from a variety of formats and sources
- Clean the data
- Analyze the data in numerous ways
- Write our analysis in a number of formats and outputs
- Visualize our data using plotting and charting libraries

Pandas tries to centralize all of this. We could use Matplotlib, a popular plotting library in Python -- but we're going to concentrate on using the Pandas wrappers for Matplotlib.

# Using Pandas

If I want to do analysis with Pandas, I have to download and install it from PyPI. That's easily done with

    pip install pandas
    
In our program, we'll want to say

    import pandas as pd

In [7]:
import pandas as pd

In [8]:
# what version are we running?

pd.__version__

'1.5.3'

In [10]:
# let's say that I want to load data

# most people work with CSV files -- comma-separated values (or "character separated values")
# - every line of the file is a record
# - every line contains multiple fields, separated by a comma or other character (e.g., \t)

# we'll read our data from a CSV file into a Pandas "data frame," meaning a 2D table

In [11]:
!ls ../data/*.csv

../data/2020_sharing_data_outside.csv  ../data/olympic_athlete_events.csv
../data/CPILFESL.csv		       ../data/san+francisco,ca.csv
../data/albany,ny.csv		       ../data/sat-scores.csv
../data/boston,ma.csv		       ../data/skyscrapers.csv
../data/burrito_current.csv	       ../data/springfield,il.csv
../data/celebrity_deaths_2016.csv      ../data/springfield,ma.csv
../data/chicago,il.csv		       ../data/taxi-distance.csv
../data/eu_cpi.csv		       ../data/taxi-passenger-count.csv
../data/eu_gdp.csv		       ../data/taxi.csv
../data/ice-cream.csv		       ../data/titanic3.csv
../data/languages.csv		       ../data/us-median-cpi.csv
../data/los+angeles,ca.csv	       ../data/us-unemployment-rate.csv
../data/miles-traveled.csv	       ../data/us_gdp.csv
../data/new+york,ny.csv		       ../data/winemag-150k-reviews.csv
../data/oecd_locations.csv	       ../data/wti-daily.csv
../data/oecd_tourism.csv


In [12]:
# let's take the CSV file for taxi data and turn it into a data frame
# the read_csv function in the pandas module is our workhorse for reading CSV files

filename = '../data/taxi.csv'
df = pd.read_csv(filename)

In [13]:
# now, let's look at the data frame
df

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.954430,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.00,0.0,0.3,17.80
1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.00,0.0,0.3,8.30
2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.20,0.0,0.3,11.00
3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.760330,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.40,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.00,0.0,0.3,10.30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9994,1,2015-06-01 00:12:59,2015-06-01 00:24:18,1,2.70,-73.947792,40.814972,1,N,-73.973358,40.783638,2,11.0,0.5,0.5,0.00,0.0,0.3,12.30
9995,1,2015-06-01 00:12:59,2015-06-01 00:28:16,1,4.50,-74.004066,40.747818,1,N,-73.953758,40.779285,1,16.0,0.5,0.5,3.00,0.0,0.3,20.30
9996,2,2015-06-01 00:13:00,2015-06-01 00:37:25,1,5.59,-73.994377,40.766102,1,N,-73.903206,40.750546,2,21.0,0.5,0.5,0.00,0.0,0.3,22.30
9997,2,2015-06-01 00:13:02,2015-06-01 00:19:10,6,1.54,-73.978302,40.748531,1,N,-73.989166,40.762852,2,6.5,0.5,0.5,0.00,0.0,0.3,7.80


In [14]:
# now what? Let's grab one column
# we can do that by naming our variable (df) and naming the column with a string inside of []
# you can think of a data frame, behind the scenes, as actually being a dict of Pandas series (1D- data)

df['trip_distance']   # this will retrieve the trip_distance column from our data frame

0       1.63
1       0.46
2       0.87
3       2.13
4       1.40
        ... 
9994    2.70
9995    4.50
9996    5.59
9997    1.54
9998    5.80
Name: trip_distance, Length: 9999, dtype: float64

In [15]:
# the CSV file we read from had 10,000 lines
# the first one contained the column names
# the rest contained data

# what can I do with this series?
# I can run various methods on it to analyze the data

# I can ask what the minimum value was
df['trip_distance'].min()

0.0