# Introduction to Python and Jupyter Notebooks Review

To begin, be sure you understand how to move between cells in a Jupyter notebook and change them from code to markdown.  If you want additional work with styling markdown cells, please see the [cheatsheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet).  In this part of the notebook, we will review some numpy basics and create some simple plots with Matplotlib.

In [1]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/T8JGn4JRy4g?ecver=1" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>

In [2]:
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

### NumPy and Matplotlib

To begin, let's play with some basic `matplotlib` plots and the NumPy random methods. For more information please consult the documentation [here](https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.random.html). 

In [3]:
a = np.random.randint(1, 20, 100)

In [40]:
plt.figure()
plt.hist(a)

<IPython.core.display.Javascript object>

(array([12., 10., 11.,  8.,  6.,  9., 15., 10.,  7., 12.]),
 array([ 1. ,  2.8,  4.6,  6.4,  8.2, 10. , 11.8, 13.6, 15.4, 17.2, 19. ]),
 <a list of 10 Patch objects>)

In [5]:
b = np.random.random(100)
c = np.random.normal(5, 10, 100)
d = np.random.binomial(100, .3, 100)

In [41]:
plt.hist(b)

(array([12., 11.,  8.,  7.,  9.,  9., 14., 10., 11.,  9.]),
 array([0.01461011, 0.11299189, 0.21137367, 0.30975544, 0.40813722,
        0.506519  , 0.60490077, 0.70328255, 0.80166433, 0.9000461 ,
        0.99842788]),
 <a list of 10 Patch objects>)

In [6]:
np.random.binomial?

In [7]:
a[:5]

array([17, 19,  7,  2, 17])

In [43]:
plt.figure(figsize = (9, 6))

plt.subplot(2, 2, 1)
plt.hist(a)
plt.title("Random Integers")

plt.subplot(2, 2, 2)
plt.hist(b, color = 'green')
plt.title("Random Floats")

plt.subplot(2, 2, 3)
plt.hist(c, color = 'grey')
plt.title("Normal Distribution")

plt.subplot(2, 2, 4)
plt.hist(d, color = 'orange')
plt.title("Binomial Distribution")

<IPython.core.display.Javascript object>

Text(0.5,1,'Binomial Distribution')

In [42]:
plt.figure()
plt.scatter(c, d)
plt.title("Scatter Plot", loc = 'left')
plt.xticks([])
plt.yticks([])

<IPython.core.display.Javascript object>

([], <a list of 0 Text yticklabel objects>)

In [44]:
dists = [a, b, c, d]
plt.figure()
plt.boxplot(dists)
plt.title("Boxplots of Distributions", loc = "right")

<IPython.core.display.Javascript object>

Text(1,1,'Boxplots of Distributions')

In [45]:
import seaborn as sns

plt.figure()
for i in [a,c,d]:
    sns.distplot(i, hist = False)

<IPython.core.display.Javascript object>

### Loading Data: Intro to Pandas

Now, we use the Pandas library to examine a variety of datasets.  Below, I create four different `DataFrame` objects from files.  The first three are from `.csv` files located in our **data** directory.  The final, is through the API from NYCOpenData.  We will continue to visit methods of accessing and structuring data, but to begin we use these two popular options.  

To load the `.csv` files, we provide Pandas with a path or url in the `.read_csv()` method.  I load all four datasets in what follows.

In [12]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/9Dsg9DQAU_g?ecver=1" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>

In [13]:
nyc311data = pd.read_json('https://data.cityofnewyork.us/resource/fhrw-4uyv.json')

In [14]:
nyc311data.columns

Index(['address_type', 'agency', 'agency_name', 'bbl', 'borough', 'city',
       'closed_date', 'community_board', 'complaint_type', 'created_date',
       'cross_street_1', 'cross_street_2', 'descriptor', 'due_date',
       'facility_type', 'incident_address', 'incident_zip',
       'intersection_street_1', 'intersection_street_2', 'latitude',
       'location', 'location_type', 'longitude', 'open_data_channel_type',
       'park_borough', 'park_facility_name', 'resolution_action_updated_date',
       'resolution_description', 'status', 'street_name',
       'taxi_pick_up_location', 'unique_key', 'x_coordinate_state_plane',
       'y_coordinate_state_plane'],
      dtype='object')

In [15]:
nyc311data.dtypes

address_type                       object
agency                             object
agency_name                        object
bbl                               float64
borough                            object
city                               object
closed_date                        object
community_board                    object
complaint_type                     object
created_date                       object
cross_street_1                     object
cross_street_2                     object
descriptor                         object
due_date                           object
facility_type                      object
incident_address                   object
incident_zip                      float64
intersection_street_1              object
intersection_street_2              object
latitude                          float64
location                           object
location_type                      object
longitude                         float64
open_data_channel_type            

In [16]:
nyc311data.describe()

Unnamed: 0,bbl,incident_zip,latitude,longitude,unique_key,x_coordinate_state_plane,y_coordinate_state_plane
count,788.0,994.0,992.0,992.0,1000.0,992.0,992.0
mean,2821326000.0,10871.207243,40.733111,-73.914738,39546780.0,1007873.0,206390.224798
std,1204321000.0,546.764302,0.082205,0.079548,2405.281,22059.89,29947.421808
min,1000330000.0,10001.0,40.511559,-74.242685,39542680.0,916768.0,125745.0
25%,2028605000.0,10453.0,40.675169,-73.956432,39544620.0,996336.0,185261.0
50%,3026485000.0,11209.5,40.719661,-73.92055,39546700.0,1006240.0,201471.0
75%,4016903000.0,11364.75,40.801983,-73.866021,39548840.0,1021327.0,231463.25
max,5073550000.0,11694.0,40.907142,-73.729944,39550840.0,1059177.0,269790.0


In [17]:
complaints = nyc311data[['complaint_type', 'borough', 'agency', 'agency_name']]

In [18]:
complaints.head()

Unnamed: 0,complaint_type,borough,agency,agency_name
0,Request Large Bulky Item Collection,BROOKLYN,DSNY,Department of Sanitation
1,Request Large Bulky Item Collection,MANHATTAN,DSNY,Department of Sanitation
2,Noise - Residential,QUEENS,NYPD,New York City Police Department
3,Noise - Residential,QUEENS,NYPD,New York City Police Department
4,Noise - Residential,QUEENS,NYPD,New York City Police Department


In [19]:
complaints.groupby(by = 'borough').size()

borough
BRONX            156
BROOKLYN         298
MANHATTAN        200
QUEENS           299
STATEN ISLAND     42
Unspecified        5
dtype: int64

In [46]:
a = complaints.groupby(by = 'borough').size()

In [47]:
a

borough
BRONX            156
BROOKLYN         298
MANHATTAN        200
QUEENS           299
STATEN ISLAND     42
Unspecified        5
dtype: int64

In [48]:
a[0]

156

In [50]:
a.index


Index(['BRONX', 'BROOKLYN', 'MANHATTAN', 'QUEENS', 'STATEN ISLAND',
       'Unspecified'],
      dtype='object', name='borough')

In [54]:
complaints.iloc[133]

complaint_type                Noise - Residential
borough                                  BROOKLYN
agency                                       NYPD
agency_name       New York City Police Department
Name: 133, dtype: object

In [20]:
complaints[complaints['borough'] =='BROOKLYN'].sort_values('complaint_type')[:10]

Unnamed: 0,complaint_type,borough,agency,agency_name
79,Blocked Driveway,BROOKLYN,NYPD,New York City Police Department
554,Blocked Driveway,BROOKLYN,NYPD,New York City Police Department
386,Blocked Driveway,BROOKLYN,NYPD,New York City Police Department
191,Blocked Driveway,BROOKLYN,NYPD,New York City Police Department
931,Blocked Driveway,BROOKLYN,NYPD,New York City Police Department
284,Blocked Driveway,BROOKLYN,NYPD,New York City Police Department
272,Blocked Driveway,BROOKLYN,NYPD,New York City Police Department
392,Blocked Driveway,BROOKLYN,NYPD,New York City Police Department
448,Blocked Driveway,BROOKLYN,NYPD,New York City Police Department
712,Blocked Driveway,BROOKLYN,NYPD,New York City Police Department


In [51]:
BK_COMPLAIN = complaints[complaints['borough'] == 'BROOKLYN']['complaint_type'].value_counts()

In [49]:
plt.figure(figsize = (7, 5))
plt.bar(BK_COMPLAIN.index[:6], BK_COMPLAIN[:6])

<IPython.core.display.Javascript object>

<Container object of 6 artists>

In [55]:
plt.tick_params(labelrotation = 20)

In [56]:
plt.rcParams["font.family"] = "fantasy"

plt.figure(figsize = (10, 7))
bars = plt.barh(BK_COMPLAIN.index[:5], BK_COMPLAIN[:5])
plt.title("Top 5 311 Complaints in Brooklyn", loc = 'left', fontsize = 16 )

<IPython.core.display.Javascript object>

Text(0,1,'Top 5 311 Complaints in Brooklyn')

In [25]:
labels = BK_COMPLAIN.index

In [26]:
for i in labels[:6]:
    print(i)

Noise - Residential
Noise - Street/Sidewalk
Noise - Commercial
Blocked Driveway
Illegal Parking
Noise - Vehicle


In [57]:
for i in range(5):
    label = labels[i]
    plt.gca().text(2, i, label, color = 'w', fontsize = 10)

In [58]:
plt.tick_params(top = 'off', bottom = 'off', left = 'off', right = 'off', labelleft='off', labelbottom='off')

In [59]:
for spine in plt.gca().spines.values():
    spine.set_visible(False)

In [30]:
plt.savefig('images/brooklyn_complaining.png')

In [60]:
nums = BK_COMPLAIN

### Titanic Manipulation

In [31]:
titanic = pd.read_csv('data/eda_data/titanic.csv')
titanic.head()

Unnamed: 0,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [32]:
titanic[titanic.pclass == 3][:5]

Unnamed: 0,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S


In [33]:
titanic.sample(frac=0.1)[:5]

Unnamed: 0,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
331,0,1,"Partner, Mr. Austen",male,45.5,0,0,113043,28.5,C124,S
408,0,3,"Birkeland, Mr. Hans Martin Monsen",male,21.0,0,0,312992,7.775,,S
235,0,3,"Harknett, Miss. Alice Phoebe",female,,0,0,W./C. 6609,7.55,,S
836,0,3,"Pasic, Mr. Jakob",male,21.0,0,0,315097,8.6625,,S
868,0,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5,,S


In [34]:
titanic.iloc[4:10]

Unnamed: 0,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [35]:
titanic.nlargest(10, 'age')

Unnamed: 0,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
630,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0,A23,S
851,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.775,,S
96,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C
493,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C
116,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.75,,Q
672,0,2,"Mitchell, Mr. Henry Michael",male,70.0,0,0,C.A. 24580,10.5,,S
745,0,1,"Crosby, Capt. Edward Gifford",male,70.0,1,1,WE/P 5735,71.0,B22,S
33,0,2,"Wheadon, Mr. Edward H",male,66.0,0,0,C.A. 24579,10.5,,S
54,0,1,"Ostby, Mr. Engelhart Cornelius",male,65.0,0,1,113509,61.9792,B30,C
280,0,3,"Duane, Mr. Frank",male,65.0,0,0,336439,7.75,,Q


In [36]:
titanic.nsmallest(10, 'age')

Unnamed: 0,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
803,1,3,"Thomas, Master. Assad Alexander",male,0.42,0,1,2625,8.5167,,C
755,1,2,"Hamalainen, Master. Viljo",male,0.67,1,1,250649,14.5,,S
469,1,3,"Baclini, Miss. Helene Barbara",female,0.75,2,1,2666,19.2583,,C
644,1,3,"Baclini, Miss. Eugenie",female,0.75,2,1,2666,19.2583,,C
78,1,2,"Caldwell, Master. Alden Gates",male,0.83,0,2,248738,29.0,,S
831,1,2,"Richards, Master. George Sibley",male,0.83,1,1,29106,18.75,,S
305,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S
164,0,3,"Panula, Master. Eino Viljami",male,1.0,4,1,3101295,39.6875,,S
172,1,3,"Johnson, Miss. Eleanor Ileen",female,1.0,1,1,347742,11.1333,,S
183,1,2,"Becker, Master. Richard F",male,1.0,2,1,230136,39.0,F4,S


In [37]:
gender = titanic[['survived', 'sex']]

In [38]:
gender[gender['survived'] == 0].groupby('sex').size()

sex
female     81
male      468
dtype: int64

In [39]:
gender.iloc[gender, ('survived' == 0)] 

TypeError: '>=' not supported between instances of 'int' and 'str'

In [None]:
gender[gender['survived'] == 1].groupby('sex').size()

### Rock Songs

In [None]:
rockin = pd.read_csv('data/eda_data/rocking.csv', index_col = 0)

In [None]:
rockin.info()

In [None]:
rockin.head()

In [None]:
#rockin = rockin.rename({'First?': 'First', 'Year?': 'Year', 'F*G': 'fg'}, axis = 1)

In [None]:
#null_release_mask = rockin['Release Year'].isnull()
#rockin.loc[null_release_mask, 'Release Year'] = 0

In [None]:
#rockin.head()

In [None]:
#rockin['ARTIST CLEAN'].unique()[::10]