# Numpy and Pandas

The numpy package provides R-style vectors to Python. These are very efficient lists containing elements of the same type (more properly, vectors) which are optimised for batch processing. Numpy is very useful for data processing in general, however, numpy vectors are used as columns in the *pandas* package.

The *pandas* package is python's analogue of R's *dataframe*. A pandas dataframe consists of a collection of columns, where each column is a numpy vector containing values. The interface for pandas is very similar to the interface for R's dataframes; and is an excellent tool for processing tabular data in Python

## Numpy

A numpy array is similar to a list. However, there are a few key differences. Find out how numpy handles addition of arrays. How does this compare to the list implementation

### Generating Sequences

We can generate sequences of numbers using the **linspace** and **arange** functions. **arange** is similar to R's *seq()* function. It allows us to generate a sequence numbers from max to min, going up in a given step. The **linspace** function is similar, and allows us to generate a set number of points evenly spaced out between a minimum and a maximum number. 

Generate a sequence of numbers from (10, 20, ... 90, 100) using arange and linspace

### Numpy Array Shapes

When we think about arrays we tend to think of them in terms of 1-dimensional lists. Any item can be located in a 1-dimensional list using a single index. A 2-dimensional list is also known as a matrix. Matrices consist of rows and columns, so to locate an item in a 2-dimensional you need 2 indices, one for the row and one for the column. Dataframes are similar to (but not the same as) 2-dimensional arays (what makes them different?).

You can check a numpy array's dimensionality using the shape property.

In [7]:
oneDValues = [1, 2, 3, 4, 5, 6, 7, 8, 9]
twoDValues = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

a3 = np.array(oneDValues)
a4 = np.array(twoDValues)
print("a3 has shape " + str(a3.shape))
print("a4 has shape " + str(a4.shape))

print("2D Array: " + str(twoDValues))
print("Element at (0,0) is " + str(a4[0, 0]))
print("Element at (1,1) is " + str(a4[1, 1]))
print("Element at (2,1) is " + str(a4[2, 1]))


a3 has shape (9,)
a4 has shape (3, 3)
2D Array: [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
Element at (0,0) is 1
Element at (1,1) is 5
Element at (2,1) is 8


### Slicing numpy arrays

Numpy arrays can be sliced using a similar syntax to regular list slicing. Remember that lists were sliced using
\[start_index:end_index\]. We can also provide an additional **step** parameter to allow us to choose the step by which we increase the index \[start_index:end_index:step\]

Take every second item from the oneDValues array

In [8]:
# second row, third column of a4


# all rows, first column of a4


# all rows, last column of a4


## Pandas

Pandas (short for *panel data*) is a library in Python which makes working with dataframes easier. It's quite similar to the R implementation of data frames so you should already be familiar with the basic concepts (even if the syntax is a little different).

Pandas represents a table as a **DataFrame**

Each column in a DataFrame is a **numpy array**

Each row in a DataFrame is implemented as a Pandas **Series**.

### Pandas Series

Pandas uses the *Series* data structure to represent a row. A Series is a list of values where each value can be accessed either by name (the name of the column it belongs to) or by index (the index of the column it belongs to). In order to create a Series we need to supply both the series *values* and the *labels*. We can keep our labels and values in python lists and pass them as parameters to the series function, or we can use a python dictionary, which can be converted directly into a series.

In [11]:
import pandas as pd
import numpy as np

labels = ['a','b', 'c']
values = np.array([10,20, 30])

s1 = pd.Series(data=values, index=labels)
print(s1)

rowDict = dict({'a': 10, 'b': 20, 'c': 30})
s2 = pd.Series(rowDict)
print(s2)


a    10
b    20
c    30
dtype: int32
a    10
b    20
c    30
dtype: int64


We have three ways of accessing a column value in a series. 

1. Using the column name with dot notation
2. Using the column names with square bracket notation
3. Using the column index with square bracket notation

In [9]:
#1 dot notation


#2 square bracket name notation


#3 square bracket column index


### Creating a Pandas DataFrame

The easiest way to create a a Pandas DataFrame is to use a list of python dictionaries. We've seen above that we can easily create a row (or *Series* from a python dictionary). The dictionary key is taken to represent the column name while the value represents the value for that row. By creating a list of dictionaries we can gather together our initial data for creating a dataframe.

In [8]:
students = list()
jack = dict({"student_number": "d12345678", "year": 1, "name": "Jack"})

jill = {}
jill["student_number"] = "d87654321"
jill["year"] = 2
jill["name"] = "Jill"

students.append(jack)
students.append(jill)
pd.DataFrame(students)

# We could shorten the above code to the following
students = [{"student_number": "d12345678", "year": 1, "name": "Jack"}, {"student_number": "d87654321", "year": 2, "name": "Jill"}]
pd.DataFrame(students)

Unnamed: 0,student_number,year,name
0,d12345678,1,Jack
1,d87654321,2,Jill


We can also create a dataframe directly from a CSV file. As with the read.csv and read_csv functions in R, there are lots of parameters we can tweak to suit our data file. You can find a full list by looking up the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).

In [12]:
df = pd.read_csv('titanic_train.csv')
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


## Basic DataFrame Inspection

We can view the top and bottom of our dataframe using the **head()** and **tail()** methods. We can check the number of rows and columns using the dataframe property **shape**

In [14]:
# check the first 5 rows using head()
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [18]:
# check the last 3 rows using tail()
df.tail(n=3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [19]:
# check the count of rows and columns using shape
df.shape

(891, 12)

### Indexing and Slicing Columns

Columns can be accessed using *dot notation* or *square bracket notation*. If we want to access multiple columns we have to use square bracket notation

In [13]:
# get the Survived column using dot notation


In [14]:
# get the Survived column using Square-bracket notation


In [15]:
# Get the PassengerID and Survived columns using square-bracket notation


### Indexing and Slicing Rows

We can select rows from a pandas dataframe using the **.loc** and **.iloc** properties. loc allows us to select rows and columns by name, whereas iloc allows us to select rows and columns by index. Usually rows have a numeric index (though it is possible to give rows a named index). If the index is numeric (as is default) we can use row indexing with *loc*

In [16]:
# Get the first five rows with .loc, selecting the Name, Age and Survived columns


In [17]:
# Get the first five rows with .iloc, selecting the Name, Age and Survived columns


### Accessing rows with boolean logic

We've seen that we can access a row if we know it's position within the dataframe. We can also find a row by value if we set the index. However, it's much more common that we want to select all rows meeting a certain condition (e.g. all rows where the label is *True*). The .loc function also lets us pass an array of True/False values, and will return only rows corresponding to *True* in the array.

In [18]:
# use logical indexing to select the first and third rows only
students = pd.DataFrame([['d19122334', 'John', 'Smith'], ['d19155667', 'Jane', 'Doe'], ['c18155334', 'Enda', 'Smith']], columns=["StudentNo", "FirstName", "LastName"])
print(students.loc[[True, False, True]])

   StudentNo FirstName LastName
0  d19122334      John    Smith
2  c18155334      Enda    Smith


This is a very long-winded way of selecting the first and last rows. But it allows us to create complex queries to retrieve certain rows in our dataset

In [None]:
isSmith = students["LastName"] == "Smith"
# Gives us [True, False, True]
print(isSmith)

print(students.loc[isSmith])

# Putting all of this together in one row we get
students.loc[students["LastName"] == "Smith"]

This format may seem long-winded and difficult at first, but you'll quickly get used to it and it will become second nature. It's quite a common way of selecting data in Pandas. The example above was somewhat simple, but you can build up complex queries using the python *logical-and* and *logical-or* operators. Watch out when using logical-or and logical-and in Pandas, you must put both sides of the equation inside round brackets for it to work correctly (see below). 

In [18]:
# get all students whose last name is Doe or firstName is John


# get all students whose last name is Smith and first name is Enda


# get all rows in the titanic dataset corresponding to children


## Pandas query syntax

As you may have noticed, a basic pandas query can result in quite a few keystrokes, particularly when the criteria is anything other than a single simple operation. Pandas provides a special function, **query()**, which allows you to use a shorter syntax for any of these expressions. This allows me, for example, to write the students' name query much more succintly, this means that

`students.loc[(students["LastName"] == "Smith") & (students["FirstName"] == "Enda")]`

becomes

In [21]:
students.query('(LastName == "Smith") & (FirstName == "Enda")')

Unnamed: 0,StudentNo,FirstName,LastName
2,c18155334,Enda,Smith


The query() function doesn't add any additional functionality to Pandas; behind the scenes it just transforms the string you give it into the equivalent python code. Using it is totally optional, but it does cut down the number of keystrokes needed. A couple of things to bear in mind
* Any string values should be surrounded in double quotes so pandas doesn't think it's a column name
* You still need to use round brackets for logical-and or logical-or operators
* Column names containing spaces need to be quoted using backtick \` \` symbols

## Setting values using .loc

As well as just reading values from a dataframe, we can use .loc to set values based on a query. Whatever value you assign will be written to every cell returned by the query. This makes it easy to apply batch updates but we need to be very careful that we don't accidentally overwrite an entire row, column, or even dataframe with an incorrect value.

In [20]:
# Set Enda Smith's Last name to Samson


In [21]:
# Set all columns in Enda Smith's row to Samson


We can also reference other columns when setting values, just like we did in *R*. The example below creates a new column (TempFahrenheit), which converts the value of the TempCelsius column

In [24]:
tempReadings = pd.DataFrame([['Westport', 3.0], ['Wexford', 8.0]], columns=['Location', 'TempCelsius'])
tempReadings['TempFahrenheit'] = (tempReadings['TempCelsius'] * 1.8) + 32
tempReadings

Unnamed: 0,Location,TempCelsius,TempFahrenheit
0,Westport,3.0,37.4
1,Wexford,8.0,46.4


## Dropping Rows or Columns

We can drop rows or columns using the **drop()** method. If we're dropping a columns we need to set the axis parameter to tell pandas we're dropping a column and not a row. In general axis 0 is rows and axis 1 is columns.

In [25]:
df2 = df.drop("Ticket", axis=1)
df2

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,30.0000,C148,C


In [26]:
# We can also drop a row by index
df3 = df.drop(0, axis=0)
df3

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


## Calculating Summary Statistics

We can calculate summary statistics on a column by selecting that column and using a summary function

In [27]:
# Get the most common value for the Sex column
df["Sex"].mode()

0    male
dtype: object

In [28]:
# Get the total amount of money spent on fares, rounded to 2 decimal places
df["Fare"].sum().round(2)

28693.95

In [29]:
# Get the standard deviation of age, rounded to 2 decimal places
df["Age"].std().round(2)

14.53

In [30]:
# Get the age of the youngest passenger on board
df["Age"].min()

0.42

In [31]:
# Get the age of the oldest passenger on board
df["Age"].max()

80.0

Many of the common functions can be applied directly to the dataframe or column. Some summary functions aren't built-in, though and need to be called directly. The percentile function is an example of a summary function which isn't built into the dataframe; notice that we pass the column in as the first argument.

In [32]:
np.percentile(df["Fare"], 75)

31.0

Watch out for missing values. The NaN (**Not a Number**s) in the dataset cause us to get strange values for some percentiles. We need to make sure we handle NaNs ourself

In [33]:
print(np.percentile(df["Age"], 75))
print("Untreated 40th Percentile Age: " + str(np.percentile(df["Age"], 40)))

# Replacing NAs with 0 is often not a good idea
print("After replacing NaN with 0: " + str(np.percentile(df["Age"].fillna(0), 40)))

print("After replacing NaN with avg: " + str(np.percentile(df["Age"].fillna(df["Age"].mean()), 40)))

nan
Untreated 40th Percentile Age: nan
After replacing NaN with 0: 20.5
After replacing NaN with avg: 28.0


If we use a summary function on a dataframe it will apply the function to each relevant column. I'm using relevant in the broadest sense here; it's not exactly relevant to calculate the mean of passengerID. PassengerID is a numeric column, though, so pandas will go ahead and calculate it for us.

In [34]:
df.mean()

PassengerId    446.000000
Survived         0.383838
Pclass           2.308642
Age             29.699118
SibSp            0.523008
Parch            0.381594
Fare            32.204208
dtype: float64