# **Data Compilation**
This is a tutorial that will teach you how to compile a subset of data from a larger data source. Data sets often have lots of categories and information, and some of it can be distracting. Creating smaller subsets of data by focusing on specific categories can help us scientifically analyze it and come to more specific conclusions.

I will be using a data set of Titanic Passengers. The data set was created to be used in machine learning.

## **Step 1)** Mounting Google Drive
After you have saved your data set as a **CSV** file to your **Google Drive**, the first thing you need to do is give CoLab access to your Google Drive.

To do this, you'll type the same command that I have in the code cell below.

*Note that anytime you need a new cell to type code in, you can click the button in the top left corner of your screen that says "+ Code"*

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


## **Step 2)** Importing Packages
Next, you'll need to import two packages called **Pandas** and **numpy**. The Pandas Package will help us visualize our data.

*Note that "np" and "pd" are optional names assigned to these packages. Naming them is not a requirement, and you can even give them different names. However, it is recommended simply because being able to type "pd" instead of "pandas" will be easier later.*

**To do this, you'll type the same command that I have in the code cell below.**

*Note that for any command you type into a code box, you'll have to click the run cell button, which you'll see when you hover over the left side of the code cell.*

In [2]:
import numpy as np
import pandas as pd

##**Step 3)** Defining the Dataframe
The next thing you'll need to do is give your dataframe a name, and basically, tell CoLab how to find it in your computer.

First, choose a name for your dataframe. You should choose something simple and easy to type because you'll have to use this name a lot. For example, I'm going to name mine data.

Next, you'll have to tell CoLab where to find your CSV file and what to do with it. To do this, we're going to use the **pd.read_csv()** function.

In the parentheses of this function, you'll give CoLab directions to the CSV file. For example, my CSV file is stored in drive -> myDrive -> Colab Notebooks and the file is called train.csv.

*Note that spaces, capitalization, and exact spelling are important. You will also need to include single quotation marks.*

**dataframe name* = pd.read_csv(*file path*)

In [3]:
data = pd.read_csv('gdrive/MyDrive/Colab Notebooks/train.csv')

##**Step 4)** Know your Data Attributes
To access the names of your categories, you can use the **.columns** function.

To access the number of rows and columns respectively, you can use the **.shape** function.

To access the total number of data values your data you can use the **.size** function.

To visualize it all together, we'll also print the dataframe in a table.

In [None]:
data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [None]:
data.shape

(891, 12)

In [None]:
data.size

10692

In [None]:
data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


##**Step 5)** Creating Subsets
Looking at the data attributes above, you can see there is a lot of data in this file.

To learn a little more about this data, we're going to make a subset. This means we're going to define a smaller dataframe that includes only the data that fits certain conditions.

With my data, I will define a subset of each class (1st, 2nd, and 3rd). To do this, for example, I'll use the following condition to make a subset of first class passengers.

**data[ "Pclass" ] == 1**.

This is saying if the value under this category is equal to the value I'm looking for, it should be included in this subset.

*Note that " == " is used in boolean (true/false) statements while " = " is used when you are defining something.*

Additionally, If there are column categories with information that I find more useful than others, I can filter this too. I'm going to only show data for the "Name" and "Survived" categories using the following condition.

*Subset Name* **[ [ "Name", "Survived" ] ]**

In [5]:
class1 = data[data["Pclass"] == 1]
class2 = data[data["Pclass"] == 2]
class3 = data[data["Pclass"] == 3]
class1[["Name", "Survived"]]

Unnamed: 0,Name,Survived
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1
6,"McCarthy, Mr. Timothy J",0
11,"Bonnell, Miss. Elizabeth",1
23,"Sloper, Mr. William Thompson",1
...,...,...
871,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",1
872,"Carlsson, Mr. Frans Olof",0
879,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",1
887,"Graham, Miss. Margaret Edith",1


In [10]:
class2[["Name", "Survived"]]

Unnamed: 0,Name,Survived
9,"Nasser, Mrs. Nicholas (Adele Achem)",1
15,"Hewlett, Mrs. (Mary D Kingcome)",1
17,"Williams, Mr. Charles Eugene",1
20,"Fynney, Mr. Joseph J",0
21,"Beesley, Mr. Lawrence",1
...,...,...
866,"Duran y More, Miss. Asuncion",1
874,"Abelson, Mrs. Samuel (Hannah Wizosky)",1
880,"Shelley, Mrs. William (Imanita Parrish Hall)",1
883,"Banfield, Mr. Frederick James",0


In [11]:
class3[["Name", "Survived"]]

Unnamed: 0,Name,Survived
0,"Braund, Mr. Owen Harris",0
2,"Heikkinen, Miss. Laina",1
4,"Allen, Mr. William Henry",0
5,"Moran, Mr. James",0
7,"Palsson, Master. Gosta Leonard",0
...,...,...
882,"Dahlberg, Miss. Gerda Ulrika",0
884,"Sutehall, Mr. Henry Jr",0
885,"Rice, Mrs. William (Margaret Norton)",0
888,"Johnston, Miss. Catherine Helen ""Carrie""",0


##**Step 6)** Analysis
Now that we have our information narrowed down, we can make some observations about our data.

### **Next I'll show you some useful functions and strategies that you can use to analyze your data.**

- **.shape**: We can use the **.shape** function to access the number of rows and columns in our table. The **.shape** function will give us two values in the form (*# rows, # columns*).
  - In my case, the number of rows tells us how many passengers are in the dataset.
- **Indexing**: When using indexes in Python, it's important to note that indexing always starts at **0**. That means to access the first item in a list, you'll have to use 0 and not 1.
  - To access the numner of rows only, we can use **[0]** after the .shape function because the number of rows will be at index 0.
- **Mathematical Operators**: To perform mathematical operations, you can use the following operators:
  - **+** addition, **-** subtraction, * multiplication, **/** division
- **str( )**: When you want to print strings and integers on the same line, you can use the **+** operator and **str( )** function to convert your integer to a string.
  - For example, if you want to print "First Class: " and the number of first class passengers, you'll have to make that number a string by typing **str(class1.shape[0])**. Then within your print function, you can add **"First Class: " + str(class1.shape[0])**

###Example:
**class3.shape[0]** will show how many 3rd class passengers in total are in our data.

**class3[class3["Survived"] == 1].shape[0]** will show how many 3rd class passengers from our data survived.

Therefore, the portion of third class passengers from our data that survived will be equal to **rows in class3[class3["Survived"] == 1] / rows in class3**

We can turn this into a percentage by multiplying it by 100 and concatinating a % symbol.



In [11]:
#Prints the number of passengers from each class that are in our data
print("First Class: " + str(class1.shape[0]))
print("Second Class: " + str(class2.shape[0]))
print("Third Class: " + str(class3.shape[0]))

First Class: 216
Second Class: 184
Third Class: 491


In [12]:
#Prints the number of passengers from each class that survived.
print("First Class Survivors: " + str(class1[class1["Survived"] == 1].shape[0]))
print("Second Class Survivors: " + str(class2[class2["Survived"] == 1].shape[0]))
print("Third Class Survivors: " + str(class3[class3["Survived"] == 1].shape[0]))

First Class Survivors: 136
Second Class Survivors: 87
Third Class Survivors: 119


In [16]:
#Prints the percentage of survivors from each class.
print("First Class Survivors: " + str(round((class1[class1["Survived"] == 1].shape[0])/(class1.shape[0]) *100)) + "%")
print("Second Class Survivors: " + str(round((class2[class2["Survived"] == 1].shape[0])/(class2.shape[0]) *100)) + "%")
print("Third Class Survivors: " + str(round((class3[class3["Survived"] == 1].shape[0])/(class3.shape[0]) *100)) + "%")

First Class Survivors: 63%
Second Class Survivors: 47%
Third Class Survivors: 24%
