# Kaggle-python-tutorial-on-machine-learning
Get the Data with Pandas
When the Titanic sank, 1502 of the 2224 passengers and crew were killed. One of the main reasons for this high level of casualties was the lack of lifeboats on this self-proclaimed "unsinkable" ship.

Those that have seen the movie know that some individuals were more likely to survive the sinking (lucky Rose) than others (poor Jack). In this course, you will learn how to apply machine learning techniques to predict a passenger's chance of surviving using Python.

Let's start with loading in the training and testing set into your Python environment. You will use the training set to build your model, and the test set to validate it. The data is stored on the web as csv files; their URLs are already available as character strings in the sample code. You can load this data with the read_csv() method from the Pandas library.

1. First, import the Pandas library as pd.
2. Load the test data similarly to how the train data is loaded.
3. Inspect the first couple rows of the loaded dataframes using the .head() method with the code provided.

In [1]:
# Import the Pandas library
import pandas as pd
# Load the train and test datasets to create two DataFrames
train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train = pd.read_csv(train_url)

test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test= pd.read_csv(test_url)
#Print the `head` of the train and test dataframes
print(train.head())
print(test.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  
  

# Understanding your data
Before starting with the actual analysis, it's important to understand the structure of your data. Both test and train are DataFrame objects, the way pandas represent datasets. You can easily explore a DataFrame using the .describe() method. .describe() summarizes the columns/features of the DataFrame, including the count of observations, mean, max and so on. Another useful trick is to look at the dimensions of the DataFrame. This is done by requesting the .shape attribute of your DataFrame object. (ex. your_data.shape)

The training and test set are already available in the workspace, as train and test. Apply .describe() method and print the .shape attribute of the training set.

In [2]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [3]:
train.shape

(891, 12)

In [4]:
test.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,418.0,418.0,332.0,418.0,418.0,417.0
mean,1100.5,2.26555,30.27259,0.447368,0.392344,35.627188
std,120.810458,0.841838,14.181209,0.89676,0.981429,55.907576
min,892.0,1.0,0.17,0.0,0.0,0.0
25%,996.25,1.0,21.0,0.0,0.0,7.8958
50%,1100.5,3.0,27.0,0.0,0.0,14.4542
75%,1204.75,3.0,39.0,1.0,0.0,31.5
max,1309.0,3.0,76.0,8.0,9.0,512.3292


In [5]:
test.shape

(418, 11)

Which of the following statements is correct?
Possible Answers
 1. The training set has 891 observations and 12 variables, count for Age is 714.            press 1
 2. The training set has 418 observations and 11 variables, count for Age is 891.            press 2
 3. The testing set has 891 observations and 11 variables, count for Age is 891.             press 3
 4. The testing set has 418 observations and 12 variables, count for Age is 714.             press 4
 
 Answer is: 1 

# Rose vs Jack, or Female vs Male
How many people in your training set survived the disaster with the Titanic? To see this, you can use the value_counts() method in combination with standard bracket notation to select a single column of a DataFrame:

# absolute numbers
train["Survived"].value_counts()

# percentages
train["Survived"].value_counts(normalize = True)
If you run these commands in the console, you'll see that 549 individuals died (62%) and 342 survived (38%). A simple way to predict heuristically could be: "majority wins". This would mean that you will predict every unseen observation to not survive.

To dive in a little deeper we can perform similar counts and percentage calculations on subsets of the Survived column. For example, maybe gender could play a role as well? You can explore this using the .value_counts() method for a two-way comparison on the number of males and females that survived, with this syntax:

train["Survived"][train["Sex"] == 'male'].value_counts()
train["Survived"][train["Sex"] == 'female'].value_counts()
To get proportions, you can again pass in the argument normalize = True to the .value_counts() method.

In [6]:

# Passengers that survived vs passengers that passed away
print("Survived passengers vs passengers passed away:\n" , train["Survived"].value_counts())

# As proportions
print("Survived passengers vs passengers passed away:\n" , train["Survived"].value_counts(normalize=True).round(2))

# Males that survived vs males that passed away
print("Males survived: \n", train["Survived"][train["Sex"] == 'male'].value_counts())

# Females that survived vs Females that passed away
print("Females survived: \n",train["Survived"][train["Sex"] == 'female'].value_counts())

# Normalized male survival
print("Males survived: \n", train["Survived"][train["Sex"] == 'male'].value_counts(normalize=True))

# Normalized female survival
print("Females survived: \n",train["Survived"][train["Sex"] == 'female'].value_counts(normalize=True))


Survived passengers vs passengers passed away:
 0    549
1    342
Name: Survived, dtype: int64
Survived passengers vs passengers passed away:
 0    0.62
1    0.38
Name: Survived, dtype: float64
Males survived: 
 0    468
1    109
Name: Survived, dtype: int64
Females survived: 
 1    233
0     81
Name: Survived, dtype: int64
Males survived: 
 0    0.811092
1    0.188908
Name: Survived, dtype: float64
Females survived: 
 1    0.742038
0    0.257962
Name: Survived, dtype: float64


# Does age play a role?
Another variable that could influence survival is age; since it's probable that children were saved first. You can test this by creating a new column with a categorical variable Child. Child will take the value 1 in cases where age is less than 18, and a value of 0 in cases where age is greater than or equal to 18.

To add this new variable you need to do two things (i) create a new column, and (ii) provide the values for each observation (i.e., row) based on the age of the passenger.

Adding a new column with Pandas in Python is easy and can be done via the following syntax:

your_data["new_var"] = 0
This code would create a new column in the train DataFrame titled new_var with 0 for each observation.

To set the values based on the age of the passenger, you make use of a boolean test inside the square bracket operator. With the []-operator you create a subset of rows and assign a value to a certain variable of that subset of observations. For example,

train["new_var"][train["Fare"] > 10] = 1
would give a value of 1 to the variable new_var for the subset of passengers whose fares greater than 10. Remember that new_var has a value of 0 for all other values (including missing values).

A new column called Child in the train data frame has been created for you that takes the value NaN for all observations.

INSTRUCTIONS
100XP
Set the values of Child to 1 is the passenger's age is less than 18 years.
Then assign the value 0 to observations where the passenger is greater than or equal to 18 years in the new Child column.
Compare the normalized survival rates for those who are <18 and those who are older. Use code similar to what you had in the previous exercise.

In [8]:
# Create the column Child and assign to 'NaN'
train["Child"] = float('NaN')

# Assign 1 to passengers under 18, 0 to those 18 or older. Print the new column.
train["Child"][train["Age"] < 18] = 1
train["Child"][train["Age"] >= 18] = 0
print(train["Child"])

# Print normalized Survival Rates for passengers under 18
print(train["Survived"][train["Child"] == 1].value_counts(normalize = True))

# Print normalized Survival Rates for passengers 18 or older
print(train["Survived"][train["Child"] == 0].value_counts(normalize = True))

0      0.0
1      0.0
2      0.0
3      0.0
4      0.0
5      NaN
6      0.0
7      1.0
8      0.0
9      1.0
10     1.0
11     0.0
12     0.0
13     0.0
14     1.0
15     0.0
16     1.0
17     NaN
18     0.0
19     NaN
20     0.0
21     0.0
22     1.0
23     0.0
24     1.0
25     0.0
26     NaN
27     0.0
28     NaN
29     NaN
      ... 
861    0.0
862    0.0
863    NaN
864    0.0
865    0.0
866    0.0
867    0.0
868    NaN
869    1.0
870    0.0
871    0.0
872    0.0
873    0.0
874    0.0
875    1.0
876    0.0
877    0.0
878    NaN
879    0.0
880    0.0
881    0.0
882    0.0
883    0.0
884    0.0
885    0.0
886    0.0
887    0.0
888    NaN
889    0.0
890    0.0
Name: Child, Length: 891, dtype: float64
1    0.539823
0    0.460177
Name: Survived, dtype: float64
0    0.618968
1    0.381032
Name: Survived, dtype: float64


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
