# Kaggle-python-tutorial-on-machine-learning
Get the Data with Pandas
When the Titanic sank, 1502 of the 2224 passengers and crew were killed. One of the main reasons for this high level of casualties was the lack of lifeboats on this self-proclaimed "unsinkable" ship.

Those that have seen the movie know that some individuals were more likely to survive the sinking (lucky Rose) than others (poor Jack). In this course, you will learn how to apply machine learning techniques to predict a passenger's chance of surviving using Python.

Let's start with loading in the training and testing set into your Python environment. You will use the training set to build your model, and the test set to validate it. The data is stored on the web as csv files; their URLs are already available as character strings in the sample code. You can load this data with the read_csv() method from the Pandas library.

1. First, import the Pandas library as pd.
2. Load the test data similarly to how the train data is loaded.
3. Inspect the first couple rows of the loaded dataframes using the .head() method with the code provided.

In [5]:
# Import the Pandas library
import pandas as pd
# Load the train and test datasets to create two DataFrames
train_url = "../input/titanicmy-first-prediction/train.csv"
train = pd.read_csv(train_url)

test_url = "../input/titanicmy-first-prediction/test.csv"
test= pd.read_csv(test_url)
#Print the `head` of the train and test dataframes
print(train.head())
print(test.head())

# Understanding your data
Before starting with the actual analysis, it's important to understand the structure of your data. Both test and train are DataFrame objects, the way pandas represent datasets. You can easily explore a DataFrame using the .describe() method. .describe() summarizes the columns/features of the DataFrame, including the count of observations, mean, max and so on. Another useful trick is to look at the dimensions of the DataFrame. This is done by requesting the .shape attribute of your DataFrame object. (ex. your_data.shape)

The training and test set are already available in the workspace, as train and test. Apply .describe() method and print the .shape attribute of the training set.

In [6]:
train.describe()

In [7]:
train.shape

In [8]:
test.describe()

In [9]:
test.shape

Which of the following statements is correct?
Possible Answers
 1. The training set has 891 observations and 12 variables, count for Age is 714.            press 1
 2. The training set has 418 observations and 11 variables, count for Age is 891.            press 2
 3. The testing set has 891 observations and 11 variables, count for Age is 891.             press 3
 4. The testing set has 418 observations and 12 variables, count for Age is 714.             press 4
 
 Answer is: 1 

# Rose vs Jack, or Female vs Male
How many people in your training set survived the disaster with the Titanic? To see this, you can use the value_counts() method in combination with standard bracket notation to select a single column of a DataFrame:

# absolute numbers
train["Survived"].value_counts()

# percentages
train["Survived"].value_counts(normalize = True)
If you run these commands in the console, you'll see that 549 individuals died (62%) and 342 survived (38%). A simple way to predict heuristically could be: "majority wins". This would mean that you will predict every unseen observation to not survive.

To dive in a little deeper we can perform similar counts and percentage calculations on subsets of the Survived column. For example, maybe gender could play a role as well? You can explore this using the .value_counts() method for a two-way comparison on the number of males and females that survived, with this syntax:

train["Survived"][train["Sex"] == 'male'].value_counts()
train["Survived"][train["Sex"] == 'female'].value_counts()
To get proportions, you can again pass in the argument normalize = True to the .value_counts() method.

In [10]:

# Passengers that survived vs passengers that passed away
print("Survived passengers vs passengers passed away:\n" , train["Survived"].value_counts())

# As proportions
print("Survived passengers vs passengers passed away:\n" , train["Survived"].value_counts(normalize=True).round(2))

# Males that survived vs males that passed away
print("Males survived: \n", train["Survived"][train["Sex"] == 'male'].value_counts())

# Females that survived vs Females that passed away
print("Females survived: \n",train["Survived"][train["Sex"] == 'female'].value_counts())

# Normalized male survival
print("Males survived: \n", train["Survived"][train["Sex"] == 'male'].value_counts(normalize=True))

# Normalized female survival
print("Females survived: \n",train["Survived"][train["Sex"] == 'female'].value_counts(normalize=True))


# Does age play a role?
Another variable that could influence survival is age; since it's probable that children were saved first. You can test this by creating a new column with a categorical variable Child. Child will take the value 1 in cases where age is less than 18, and a value of 0 in cases where age is greater than or equal to 18.

To add this new variable you need to do two things (i) create a new column, and (ii) provide the values for each observation (i.e., row) based on the age of the passenger.

Adding a new column with Pandas in Python is easy and can be done via the following syntax:

your_data["new_var"] = 0
This code would create a new column in the train DataFrame titled new_var with 0 for each observation.

To set the values based on the age of the passenger, you make use of a boolean test inside the square bracket operator. With the []-operator you create a subset of rows and assign a value to a certain variable of that subset of observations. For example,

train["new_var"][train["Fare"] > 10] = 1
would give a value of 1 to the variable new_var for the subset of passengers whose fares greater than 10. Remember that new_var has a value of 0 for all other values (including missing values).

A new column called Child in the train data frame has been created for you that takes the value NaN for all observations.

INSTRUCTIONS
100XP
Set the values of Child to 1 is the passenger's age is less than 18 years.
Then assign the value 0 to observations where the passenger is greater than or equal to 18 years in the new Child column.
Compare the normalized survival rates for those who are <18 and those who are older. Use code similar to what you had in the previous exercise.

In [11]:
# Create the column Child and assign to 'NaN'
train["Child"] = float('NaN')

# Assign 1 to passengers under 18, 0 to those 18 or older. Print the new column.
train["Child"][train["Age"] < 18] = 1
train["Child"][train["Age"] >= 18] = 0
print(train["Child"])

# Print normalized Survival Rates for passengers under 18
print(train["Survived"][train["Child"] == 1].value_counts(normalize = True))

# Print normalized Survival Rates for passengers 18 or older
print(train["Survived"][train["Child"] == 0].value_counts(normalize = True))

# First Prediction
In one of the previous exercises you discovered that in your training set, females had over a 50% chance of surviving and males had less than a 50% chance of surviving. Hence, you could use this information for your first prediction: all females in the test set survive and all males in the test set die.

You use your test set for validating your predictions. You might have seen that contrary to the training set, the test set has no Survived column. You add such a column using your predicted values. Next, when uploading your results, Kaggle will use this variable (= your predictions) to score your performance.

Create a variable test_one, identical to dataset test
Add an additional column, Survived, that you initialize to zero.
Use vector subsetting like in the previous exercise to set the value of Survived to 1 for observations whose Sex equals "female".
Print the Survived column of predictions from the test_one dataset.

In [12]:
# Create a copy of test: test_one
test_one=test

# Initialize a Survived column to 0
test_one['Survived'] = 0

# Set Survived to 1 if Sex equals "female" and print the `Survived` column from `test_one`
test_one['Survived'][test_one['Sex'] == 'female'] = 1
print(test_one['Survived'])