### DAT Class 5

### Section 1:  Working With Numpy

In [1]:
# import the library
import numpy as np

In [2]:
# numpy is unique because it can create two dimensional arrays
a = np.arange(16).reshape(4, 4)
a

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

The above data structure would be tedious to create in normal Python, and difficult to manipulate.  But in Numpy it's very natural, and is the default data structure that's used to represent most forms of 2-dimensional data:  csv's, excel files, sql databases, etc.

In [3]:
# you can treat the data selected within a numpy array as one variable
a.mean()

7.5

In [4]:
# likewise, any portion of the numpy array that's selected will also be treated the same way
a[0].mean()

1.5

The above value is the average for all the values in the first row of the array 'a'

You can also use slice notation to access both the rows and columns within a numpy array

In [5]:
# the first colon is used to access rows within an array, the second one columns
a[:, :]

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [6]:
# the line below grabs the first two rows, and all the columns
a[:2, :]

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

In [7]:
# the line below grabs the first two rows and the last two columns
a[:2, 2:]

array([[2, 3],
       [6, 7]])

In [8]:
# the line below grabs the first two rows, and the values in columns 1, 3
a[:2, [1, 3]]

array([[1, 3],
       [5, 7]])

In [9]:
# you can also aggregate these values like we did previously
# this is the summation of the first two columns and the first two rows
a[:2, :2].sum()

10

### Axes

Numpy can also perform operations across columns and rows, which are denoted by axes.
 - Axis 1 are the rows in the array
 - Axis 0 are the columns

In [73]:
# returns the mean across columns
a.mean(0)

array([6., 7., 8., 9.])

In [74]:
# returns the mean across rows
a.mean(1)

array([ 1.5,  5.5,  9.5, 13.5])

Most of the time you are working with other libraries built on top of numpy, but most of the time you are making Numpy function calls underneath the hood, so it's very beneficial to understand how Numpy works.

Likewise, you can use Numpy commands interchangeably with many other libraries, as we'll see below.

### Section II:  Introduction to Pandas

 - Pandas is the library most frequently used to access data from externals sources
 - Easily connects to external data:
  - csv
  - excel
  - sql
  - json
  - hdfs
  - etc
 - Is built on top of Numpy, and is hence compatible with most numpy commands, but has a variety of customized methods added on top of it.

In [10]:
# the following command reads in a csv file to a pandas dataframe
import pandas as pd
df = pd.read_csv(r'C:\Users\Jonat\OneDrive\General Assembly\Data-Science\Machine Learning Bootcamp\Data\titanic.csv')

### Selecting Information With Pandas

In [11]:
# the following grabs the first 5 rows of the dataframe
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [12]:
# this grabs the last 5
df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [13]:
# if you pass in an integer as an argument, it will display that many rows in head() or tail()
df.head(10) # returns the first 10 rows of the dataframe

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [14]:
# columns can be accessed by their labels in a manner similar to looking up a key in a dictionary
df['Name']

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
5                                       Moran, Mr. James
6                                McCarthy, Mr. Timothy J
7                         Palsson, Master. Gosta Leonard
8      Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9                    Nasser, Mrs. Nicholas (Adele Achem)
10                       Sandstrom, Miss. Marguerite Rut
11                              Bonnell, Miss. Elizabeth
12                        Saundercock, Mr. William Henry
13                           Andersson, Mr. Anders Johan
14                  Vestrom, Miss. Hulda Amanda Adolfina
15                      Hewlett, Mrs. (Mary D Kingcome) 
16                                  Rice, Master. Eugene
17                          Wil

In [15]:
# you can also use slice notation to manage how many rows show up for your selection
# this returns the first 20 rows of your selection in a pandas dataframe
df['Name'][:20]

0                               Braund, Mr. Owen Harris
1     Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                Heikkinen, Miss. Laina
3          Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                              Allen, Mr. William Henry
5                                      Moran, Mr. James
6                               McCarthy, Mr. Timothy J
7                        Palsson, Master. Gosta Leonard
8     Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9                   Nasser, Mrs. Nicholas (Adele Achem)
10                      Sandstrom, Miss. Marguerite Rut
11                             Bonnell, Miss. Elizabeth
12                       Saundercock, Mr. William Henry
13                          Andersson, Mr. Anders Johan
14                 Vestrom, Miss. Hulda Amanda Adolfina
15                     Hewlett, Mrs. (Mary D Kingcome) 
16                                 Rice, Master. Eugene
17                         Williams, Mr. Charles

Any variant of Python's slice notation can be used to access information inside of a dataframe.

In [16]:
# to access two columns by their labels, you have to pass them in as a list
df[['Name', 'Age']].head()

Unnamed: 0,Name,Age
0,"Braund, Mr. Owen Harris",22.0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0
2,"Heikkinen, Miss. Laina",26.0
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0
4,"Allen, Mr. William Henry",35.0


In [17]:
# you can also aggregate any slice of a dataframe
# this returns the average value of the first 20 rows of the 'Age' and 'Fare' columns
df[['Age', 'Fare']][:20].mean()

Age     28.00000
Fare    22.19937
dtype: float64

In [18]:
# rows and columns can also be accessed via index positions and slices using the .iloc command
# this returns the first 50 rows of the first two columns
df.iloc[:50, :2]

Unnamed: 0,PassengerId,Survived
0,1,0
1,2,1
2,3,1
3,4,1
4,5,0
5,6,0
6,7,0
7,8,0
8,9,1
9,10,1


In [19]:
# if you want to grab rows or columns that can't be accessed via slices, you can pass them in as a list
# this grabs the first 20 rows of columns 1, 4, and 5
df.iloc[:20, [1, 4, 5]]

Unnamed: 0,Survived,Sex,Age
0,0,male,22.0
1,1,female,38.0
2,1,female,26.0
3,1,female,35.0
4,0,male,35.0
5,0,male,
6,0,male,54.0
7,0,male,2.0
8,1,female,27.0
9,1,female,14.0


In [20]:
# and in a similar manner, the resulting dataframe from a slice can be aggregated
df.iloc[:20, [1, 4, 5]].mean()

Survived     0.5
Age         28.0
dtype: float64

### Selecting DataFrames Based on Conditions

In addition to using labels, indices and slices to select information within dataframes, you can also create boolean conditions that evaluate rows within a dataframe to test whether or not they meet them.  

This mimics SELECT/FROM/WHERE statements you'd typically use in SQL.

In [21]:
# the following command returns True or False for all of the rows in the dataframe
df.Age > 30

0      False
1       True
2      False
3       True
4       True
5      False
6       True
7      False
8      False
9      False
10     False
11      True
12     False
13      True
14     False
15      True
16     False
17     False
18      True
19     False
20      True
21      True
22     False
23     False
24     False
25      True
26     False
27     False
28     False
29     False
       ...  
861    False
862     True
863    False
864    False
865     True
866    False
867     True
868    False
869    False
870    False
871     True
872     True
873     True
874    False
875    False
876    False
877    False
878    False
879     True
880    False
881     True
882    False
883    False
884    False
885     True
886    False
887    False
888    False
889    False
890     True
Name: Age, Length: 891, dtype: bool

If we take this same statement, and pass it into the dataframe itself, it'll only return rows where the condition evalutes to True.

In [22]:
df[df.Age > 30]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
13,14,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.2750,,S
15,16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0,0,248706,16.0000,,S
18,19,0,3,"Vander Planke, Mrs. Julius (Emelia Maria Vande...",female,31.0,1,0,345763,18.0000,,S
20,21,0,2,"Fynney, Mr. Joseph J",male,35.0,0,0,239865,26.0000,,S
21,22,1,2,"Beesley, Mr. Lawrence",male,34.0,0,0,248698,13.0000,D56,S


In [23]:
# we can build on the selecting methods we used before and combine them with boolean selectors
# this returns the 'Survived' column where the Age of the passenger was > 30
df[df.Age > 30]['Survived']

1      1
3      1
4      0
6      0
11     1
13     0
15     1
18     0
20     0
21     1
25     1
30     0
33     0
35     0
40     0
52     1
54     0
61     1
62     0
70     0
74     1
85     1
92     0
94     0
96     0
98     1
99     0
103    0
104    0
108    0
      ..
808    0
809    1
811    0
812    0
814    0
817    0
818    0
820    1
822    0
829    1
835    1
838    1
843    0
845    0
847    0
851    0
854    0
856    1
857    1
860    0
862    1
865    1
867    0
871    1
872    0
873    0
879    1
881    0
885    0
890    0
Name: Survived, Length: 305, dtype: int64

In [24]:
# the first 30 rows of said column
df[df.Age > 30]['Survived'][:30]

1      1
3      1
4      0
6      0
11     1
13     0
15     1
18     0
20     0
21     1
25     1
30     0
33     0
35     0
40     0
52     1
54     0
61     1
62     0
70     0
74     1
85     1
92     0
94     0
96     0
98     1
99     0
103    0
104    0
108    0
Name: Survived, dtype: int64

In [25]:
# the average value of the above dataframe slice
df[df.Age > 30]['Survived'][:30].mean()

0.36666666666666664

In [26]:
# you can also use the .iloc command with the above notation
df[df.Age > 30].iloc[:50, [1, 2]][:25]

Unnamed: 0,Survived,Pclass
1,1,1
3,1,1
4,0,3
6,0,1
11,1,1
13,0,3
15,1,2
18,0,3
20,0,2
21,1,2


#### Combining Multiple Conditions For Boolean Conditions

 - You can select rows based on multiple criteria
 - Follows similar rules for returning True/False values, but uses slightly different syntax
 - 'and' operator is replaced with symbol '&', 'or' is replaced with '|'
 - Each condition has to be wrapped up in a parentheses

In [27]:
# this selects for rows where Age is greater than 30, and sex is equal to 'female'
(df.Age > 30) & (df.Sex == 'female')

0      False
1       True
2      False
3       True
4      False
5      False
6      False
7      False
8      False
9      False
10     False
11      True
12     False
13     False
14     False
15      True
16     False
17     False
18      True
19     False
20     False
21     False
22     False
23     False
24     False
25      True
26     False
27     False
28     False
29     False
       ...  
861    False
862     True
863    False
864    False
865     True
866    False
867    False
868    False
869    False
870    False
871     True
872    False
873    False
874    False
875    False
876    False
877    False
878    False
879     True
880    False
881    False
882    False
883    False
884    False
885     True
886    False
887    False
888    False
889    False
890    False
Length: 891, dtype: bool

In [28]:
# this does the same, but on the 'or' condition
(df.Age > 30) | (df.Sex == 'female')

0      False
1       True
2       True
3       True
4       True
5      False
6       True
7      False
8       True
9       True
10      True
11      True
12     False
13      True
14      True
15      True
16     False
17     False
18      True
19      True
20      True
21      True
22      True
23     False
24      True
25      True
26     False
27     False
28      True
29     False
       ...  
861    False
862     True
863     True
864    False
865     True
866     True
867     True
868    False
869    False
870    False
871     True
872     True
873     True
874     True
875     True
876    False
877    False
878    False
879     True
880     True
881     True
882     True
883    False
884    False
885     True
886    False
887     True
888     True
889    False
890     True
Length: 891, dtype: bool

In [29]:
# if we pass the above statement into a dataframe, we'll get the rows where this evaluates to True
df[(df.Age > 30) | (df.Sex == 'female')]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
13,14,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.2750,,S


In [30]:
# and of course we can aggregate, slice, and select like we did previously
df[(df.Age > 30) | (df.Sex == 'female')]['Survived'][:100].mean()

0.47

### Creating New Data in a Pandas Dataframe

In [31]:
# you can create new columns in a manner that's analogous to creating a key in a dictionary
df['Age_Double'] = df.Age * 2
df['Age_Double'].head()

0    44.0
1    76.0
2    52.0
3    70.0
4    70.0
Name: Age_Double, dtype: float64

But what if you want to create information based on what already exists within your dataframe? 

Ie, create new columns based on info that depends on other columns within the dataframe.

We'll go over two scenarios:
 - a binary choice
 - one that has multiple possible outcomes

For the binary choice we'd use the command np.where()

In [32]:
# this create a column that determines if a passenger was an adult or child
# depending on whether or not they're above the age of 25 or not
df['Adult_Or_Not'] = np.where(df.Age >= 25, 'Adult', 'Adolescent')
df[['Age','Adult_Or_Not']].head()

Unnamed: 0,Age,Adult_Or_Not
0,22.0,Adolescent
1,38.0,Adult
2,26.0,Adult
3,35.0,Adult
4,35.0,Adult


For more complicated conditions, we'll want to use the command np.select()

For example, let's create a column that divides passengers on gender and whether or not they're an adult.

In [33]:
# create a list that contains your different comparisons you want to make
conditions = [
    (df.Sex == 'female') & (df.Adult_Or_Not == 'Adult'),
    (df.Sex == 'female') & (df.Adult_Or_Not == 'Adolescent'),
    (df.Sex == 'male') & (df.Adult_Or_Not == 'Adult'),
    (df.Sex == 'male') & (df.Adult_Or_Not == 'Adolescent')
]

# these are the corresponding results that we want for each condition
results = ['Female Adult', 'Female Adolescent', 'Male Adult', 'Male Adolescent']

# we can now pass these into the np.select command
# the last argument is the value that'll be used if every condition is false
df['Status'] = np.select(conditions, results, 'Other')
df['Status'].head(10)

0      Male Adolescent
1         Female Adult
2         Female Adult
3         Female Adult
4           Male Adult
5      Male Adolescent
6           Male Adult
7      Male Adolescent
8         Female Adult
9    Female Adolescent
Name: Status, dtype: object