Yellow Cab Chicago has hired your small accounting and consulting firm to help them reduce costs and increase revenues. Like many traditional cab companies, Yellow Cab is feeling significant pressure from ride-hailing companies like Lyft and Uber. You are a recent graduate and new to the firm and are excited to learn you will have a role in this engagement. 

## Load and Look at the Data

Accompanying the case is a CSV file containing 2,000,000 rows of actual cab data from the City of Chicago. Load the data from the CSV file into a DataFrame using the Pandas package. Look at the columns and their data types. How many rows and columns does the DataFrame have? Do any of the `DataFrame` columns have null values?

<center><img src="images/rocket.png" alt="Rocket" /></center>

### Testing Code Cells

Like any kind of real data, there are mistakes and omissions in our table. Let’s start by removing null values from key columns of interest. We will not remove all null values, because that would eliminate too much data. Rather, we will think strategically about what variables (or columns) we really care about, and eliminate any nulls in those columns.

In [1]:
import matplotlib.pyplot as plt

friends = [ 70,  65,  72,  63,  71,  64,  60,  64,  67]
minutes = [175, 170, 205, 120, 220, 130, 105, 145, 190]
labels =  ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']

plt.scatter(friends, minutes)

# label each point
for label, friend_count, minute_count in zip(labels, friends, minutes):
    plt.annotate(label,
        xy=(friend_count, minute_count), # Put the label with its point
        xytext=(5, -5),                  # but slightly offset
        textcoords='offset points')

plt.title("Daily Minutes vs. Number of Friends")
plt.xlabel("# of friends")
plt.ylabel("daily minutes spent on the site")
plt.show()

<Figure size 640x480 with 1 Axes>

But how about NumPy arrays?

In [2]:
import numpy as np

numpy_array = np.random.rand(6, 12)
numpy_array

array([[0.91641824, 0.54310168, 0.8091056 , 0.88245616, 0.91964695,
        0.91648015, 0.63045884, 0.29604678, 0.19678313, 0.39994041,
        0.38253659, 0.57745387],
       [0.73317156, 0.89589541, 0.25502668, 0.08824095, 0.97800119,
        0.358111  , 0.56699541, 0.14422642, 0.14421129, 0.07024278,
        0.564233  , 0.84338071],
       [0.34844971, 0.46429215, 0.78501114, 0.53640208, 0.69418632,
        0.51274316, 0.88041439, 0.90257281, 0.23093572, 0.62138988,
        0.64943243, 0.78328884],
       [0.73316043, 0.26546137, 0.86807861, 0.8238116 , 0.46813309,
        0.04077674, 0.90260839, 0.29770408, 0.93569717, 0.26011497,
        0.07114519, 0.71901327],
       [0.66877959, 0.40267311, 0.31362379, 0.39005604, 0.14184502,
        0.88909397, 0.61879475, 0.78452009, 0.88572798, 0.63119705,
        0.39234353, 0.57453366],
       [0.6995472 , 0.37564836, 0.97052624, 0.02683573, 0.0733955 ,
        0.76443673, 0.67437029, 0.34452404, 0.27337582, 0.6010999 ,
        0.12309178,

Alright, let's see how my blog handles syntax highlighting.

In [3]:
# Program to count the number of each vowels

# string of vowels
vowels = 'aeiou'

ip_str = 'Hello, have you tried our tutorial section yet?'

# make it suitable for caseless comparisions
ip_str = ip_str.casefold()

# make a dictionary with each vowel a key and value 0
count = {}.fromkeys(vowels,0)

# count the vowels
for char in ip_str:
   if char in count:
       count[char] += 1

print(count)

{'a': 2, 'e': 5, 'i': 3, 'o': 5, 'u': 3}


Simple table

In [4]:
import seaborn as sns

df_titanic = sns.load_dataset('titanic')
df_titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [5]:
df_iris = sns.load_dataset('iris')
df_iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


Brain networks dataset

In [6]:
df_brain_networks = sns.load_dataset('brain_networks')
df_brain_networks

Unnamed: 0,network,1,1.1,2,2.1,3,3.1,4,4.1,5,...,16.5,16.6,16.7,17,17.1,17.2,17.3,17.4,17.5,17.6
0,node,1,1,1,1,1,1,1,1,1,...,3,4,4,1,1,2,2,3,3,4
1,hemi,lh,rh,lh,rh,lh,rh,lh,rh,lh,...,rh,lh,rh,lh,rh,lh,rh,lh,rh,lh
2,,,,,,,,,,,...,,,,,,,,,,
3,0,56.05574417114258,92.03103637695312,3.391575574874878,38.65968322753906,26.203819274902344,-49.71556854248047,47.4610366821289,26.746612548828125,-35.898860931396484,...,0.6079040169715881,-70.27054595947266,77.36577606201172,-21.73455047607422,1.0282527208328247,7.7917842864990225,68.90372467041016,-10.520872116088867,120.49046325683594,-39.686431884765625
4,1,55.5472526550293,43.6900749206543,-65.49598693847656,-13.974522590637207,-28.27496337890625,-39.05012893676758,-1.2106596231460571,-19.012897491455078,19.568010330200195,...,57.49507141113281,-76.39321899414062,127.26136016845705,-13.035799026489258,46.3818244934082,-15.752449989318848,31.00033187866211,-39.607521057128906,24.76401138305664,-36.7710075378418
5,2,60.99776840209961,63.43879318237305,-51.10858154296875,-13.561346054077148,-18.842947006225586,-1.2146592140197754,-65.5758056640625,-85.77742767333984,19.247453689575195,...,28.31736946105957,9.063977241516113,45.4932632446289,26.0334415435791,34.21220016479492,1.326109766960144,-22.58075714111328,12.985169410705565,-75.02745056152344,6.434262275695801
6,3,18.514867782592773,12.65715789794922,-34.576602935791016,-32.665958404541016,-7.420454025268555,17.119447708129886,-41.80086898803711,-58.61018371582031,32.896915435791016,...,71.43962860107422,65.84297943115234,-10.69754695892334,55.29746627807617,4.2550063133239755,-2.420143842697144,12.098393440246582,-15.819171905517578,-37.36143112182617,-4.650953769683838
7,4,-2.5273923873901367,-63.10466766357422,-13.8141508102417,-15.83798885345459,-45.21692657470703,3.4835495948791504,-62.61333465576172,-49.07650756835938,18.396759033203125,...,95.59756469726562,50.96045303344727,-23.19729995727539,43.06756210327149,52.21987533569336,28.23288154602051,-11.71975040435791,5.453648567199707,5.169828414916992,87.80913543701172
8,5,-24.90679168701172,-51.19189453125,-29.86799430847168,-27.840293884277344,-24.45510673522949,47.115760803222656,-48.46282196044922,-35.40941619873047,-15.90056610107422,...,71.70707702636719,108.20982360839844,-38.98595428466797,56.43561553955078,11.073355674743652,22.7122859954834,-24.315147399902344,11.061019897460938,-51.8962516784668,63.12318420410156
9,6,17.273710250854492,0.5400829315185547,18.649370193481445,-9.105488777160645,-2.1172263622283936,87.95771026611328,-16.89242172241211,-30.35905265808105,31.080501556396484,...,21.22088050842285,112.86585235595705,-11.026081085205078,43.089622497558594,18.862913131713867,58.88899993896485,-48.4278450012207,-1.9686769247055051,-88.8005599975586,79.81661224365234


## Other elements in Jupyter notebook

- I recommend you download and install the Anaconda distribution of Python. This is a premiere distribution of Python that is designed for data science. It includes a Jupyter notebook server and JupyterLab.
- If you want to work on the cloud, you can try Microsoft Azure Notebooks. This is a free service from Microsoft that you can use to study the following:
    * Statistics
    * Calculus
    * Data Analytics