# Very Short Pandas Tutorial

In [1]:
# Load library
import numpy as np
import pandas as pd

## Adding a new Row to a Dataframe

In [2]:
# Create empty dataframe
df = pd.DataFrame()

# Create a column
df['name'] = ['Alice', 'Bob', 'Charlie']

# View dataframe
df

Unnamed: 0,name
0,Alice
1,Bob
2,Charlie


In [3]:
# Assign a new column to df called 'age' with a list of ages
# Make sure the new row, has the same number of values as the original!
df.assign(age = [16, 17, 18])

Unnamed: 0,name,age
0,Alice,16
1,Bob,17
2,Charlie,18


In [4]:
# View dataframe
df

Unnamed: 0,name
0,Alice
1,Bob
2,Charlie


In [5]:
# Be sure to store the result!
df = df.assign(age = [16, 17, 18])
df

Unnamed: 0,name,age
0,Alice,16
1,Bob,17
2,Charlie,18


## Accessing Columns

A DataFrame is made up of rows and columns, and it is internally represented using pd.Series... which in turn are numpy arrays... (great knowledge for trival pursuit). You get columns out of a DataFrame the same way you would normally get elements out of a dictionary.

In [6]:
df['age']

0    16
1    17
2    18
Name: age, dtype: int64

We can also select multiple columns at once, the columns are returned in the order you request them.

In [7]:
df[['age', 'name']]

Unnamed: 0,age,name
0,16,Alice
1,17,Bob
2,18,Charlie


## Accessing Rows
You would probably expect to access rows, just like you normally would in numpy arrays or python lists. 
Here is where things are ever so slightly different (you just need to know the syntax)

In [8]:
df[1]

KeyError: 1

In [9]:
# In order to get the data located at an index we use .iloc
print(df.iloc[1])

print()
# We also have another way, slices (although resulting output types are slightly different).
print(df[1:2])

name    Bob
age      17
Name: 1, dtype: object

  name  age
1  Bob   17


## Applying operators to a column in a dataframe

In [None]:
data = {'name': ['Alice', 'bob', 'Charlie', 'Dennis', 'Eric'], 
        'year': [2012, 2012, 2013, 2014, 2014], 
        'reports': [4, 24, 31, 2, 3],
        'coverage': [25, 94, 57, 62, 70]}

# Do note that the index, is not the same as Data! (they merely replace the row numbers, that we saw in previous examples)
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df

In [None]:
capitalizer = lambda x: x.upper()

#apply() can apply a function along any axis of the dataframe
df['name'].apply(capitalizer)

In [None]:
# As said, don't forget to save your work, or you a lot of work, without any gain.
df

In [None]:
df['reports'] = df['reports'].apply(np.sqrt)
df

### Grabbing some easy numbers

In [None]:
df.describe()

# More info (scroll down for examples)
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html

## Grouping and applying operators to them

In [None]:
# Create dataframe
raw_data = {'city': ['Amsterdam', 'Amsterdam', 'Amsterdam', 'Amsterdam', 'Utrecht', 'Utrecht', 'Utrecht', 'Utrecht', 'Den Haag', 'Den Haag', 'Den Haag', 'Den Haag'], 
        'team': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 
        'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 
        'preGoals': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
        'postGoals': [25, 32, 49, 20, 15, 8, 35, 34, 17, 24, 21, 33]}
df = pd.DataFrame(raw_data, columns = ['city', 'team', 'name', 'preGoals', 'postGoals'])
df

In [None]:
# Create a groupby variable that groups preTestScores by regiment
groupby_city = df['preGoals'].groupby(df['city'])

# As you can see, it does not really do anything yet, we merely classified something to be considered a group.
groupby_city

So now we have a group... But what can we do with it? To start things off, we can ask for more details.

In [None]:
groupby_city.describe()

We can also ask specific statistics, and even group them further

In [None]:
groupby_city.mean()

In [None]:
# Group by city, then by team, and then give the the mean.
df['preGoals'].groupby([df['city'], df['team']]).mean()

We can also display the above with the cities on the rows, and the teams on the columns:

In [None]:
df['preGoals'].groupby([df['city'], df['team']]).mean().unstack()

Or we just apply this idea to the original dataframe

In [None]:
df.groupby(['city', 'team']).mean()

We can count occurences as well, per combination that we define

In [None]:
df.groupby(['city', 'team']).size()

Although you can do plenty more, as a final example we show that you can also iterate over them

In [None]:
# Iterate over the data of the dataframe, grouped by city.
for name, group in df.groupby('city'): 
    # print the name of the city
    print(name)
    # print the data belonging to that city
    print(group)

## Removing entries with NaN/empty values

In [None]:
# Create feature matrix
X = np.array([[1, 2], 
              [6, 3], 
              [8, 4], 
              [9, 5], 
              [np.nan, 4]])

# Load data as a data frame
df = pd.DataFrame(X, columns=['feature_1', 'feature_2'])

# Remove observations with missing values (default is rows, else we have to use the
# axis=1 parameter of dropna()).
df.dropna(axis=1)

# Loading a dataset


In [None]:
# By default read_csv, assumes the delimiter to be commas. 
# Incase you have something else, like a semi-colon. we add the following parameter:
# sep=';'.
complaints = pd.read_csv("311-service-requests.csv", sep=',')

complaints[:5]

## Value counts
So, at times it might interesting to know, what we actually complain about the most. Don't worry Pandas got you covered.

In [None]:
complaints['Complaint Type'].value_counts()[:5]

## Finding specific entries (Masking)
So, at times you want to find entries that uphold specific requirements. For instance if we look at the first entry. Let's find all entries that contain a specific Noise complaint. Which is labeled as
    
    Noise - Street/Sidewalk
    


Now let us take a closer look as to how we would approach this.

In [None]:
complaints['Complaint Type'] == "Noise - Street/Sidewalk"

In [None]:
complaints[complaints['Complaint Type'] == "Noise - Street/Sidewalk"]

### Alternatively we can also try looking for all possible Noise complaints.


In [None]:
is_noise = complaints['Complaint Type'].str.contains("Noise")
is_noise

### Let's use this mask.

In [None]:
noise_complaints = complaints[is_noise]
noise_complaints[:3]

### Let's unleash some Pandas to find all the possible Noise complaints.

In [None]:
complaints[complaints['Complaint Type'].str.contains("Noise")]['Complaint Type'].unique()

So, we see that the comparison creates a truth/false series. Which can be used to state which row should or should not be returned.. But there's more! You can also combine multiple of these conditions at once, using logic (not mentally, but like coding logic)

In [None]:
is_noise = complaints['Complaint Type'] == "Noise - Street/Sidewalk"
in_brooklyn = complaints['Borough'] == "BROOKLYN"
by_agency = complaints['Agency'] == "NYPD"

# Return all noise complaints, that are not in Brooklyn, and are handled by the NYPD
complaints[is_noise & ~in_brooklyn & by_agency][:5]

And incase you do not want all of the 52 columns:

In [None]:
complaints[is_noise & in_brooklyn][['Complaint Type', 'Borough', 'Created Date', 'Descriptor']][:10]

## So... now the question remains, who is the best at making some noise?

In [None]:
is_noise = complaints['Complaint Type'].str.contains("Noise")

noise_complaints = complaints[is_noise]
noise_complaints['Borough'].value_counts()



### Is the above solution valid, if we say that Manhattan must be the noisiest?

In [None]:
# Answer pending from you guys (or me, but preferably you).

In [None]:
complaints['Borough'].value_counts()

In [None]:
total_complaints = complaints['Borough'].value_counts()
noise_complaints['Borough'].value_counts() / total_complaints

#### Further examples
Most examples have been taken from https://github.com/jvns/pandas-cookbook.
Which contains a lot more than discussed here, but time is limited during the lecture so we limited it to the above.