Skip to content

sinderpreet31/Titanic-Competition-Kaggle

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

This Python code is designed to perform data analysis and machine learning on the Titanic dataset, a popular dataset for beginners in data science. The dataset includes passenger information from the Titanic, such as age, sex, passenger class, and whether they survived the sinking. The goal is to predict survival based on these features. Here's a step-by-step explanation of the code:

  1. Import Libraries: First, the code imports necessary Python libraries. numpy is used for linear algebra operations, and pandas for data processing and CSV file I/O (input/output).

  2. List Dataset Files: It lists all files in the (https://www.kaggle.com/competitions/titanic) directory, which is a common structure in Kaggle's environment. The dataset includes files such as train.csv, test.csv, and gender_submission.csv.

  3. Load the Dataset: The training and test datasets are loaded into pandas DataFrames using pd.read_csv. The training dataset (train.csv) includes labels (survival information), while the test dataset (test.csv) does not.

  4. Data Inspection:

    • Using train.shape and test.shape, the code prints the dimensions of the datasets to understand their size.
    • train.head() displays the first few rows of the training dataset for a preliminary inspection of the data and columns.
    • train.info() and test.info() provide summaries of the datasets, including the number of non-null entries in each column and the data type of each column.
  5. Data Cleaning:

    • The Cabin column, which has a significant amount of missing data, is dropped from both datasets with drop(columns=['Cabin'], inplace=True).
    • Missing values in the Embarked column of the training dataset are filled with 'S' using fillna, as 'S' is the most common embarkation point. The warning message suggests using a different method to avoid future deprecation issues.
    • Missing values in the Fare column of the test dataset are filled with the column's mean using fillna(test['Fare'].mean()).
    • The Age column's missing values are filled with randomly generated ages within one standard deviation from the mean to maintain the column's original distribution.
  6. Exploratory Data Analysis (EDA):

    • Group by operations are used to analyze survival rates based on Pclass, Sex, and Embarked.
    • Distribution plots (using seaborn's deprecated distplot) are created for age and fare to visualize the differences between survivors and non-survivors.
  7. Feature Engineering:

    • A new column family is created by adding SibSp (number of siblings/spouses aboard) and Parch (number of parents/children aboard), plus one (for the passenger themselves), to calculate the size of each passenger's family.
    • A function cal is defined to categorize the family size into 'Alone', 'Medium', or 'Large', which is then applied to both datasets to create a family_size column.
    • Unnecessary columns such as SibSp, Parch, Name, and PassengerId are dropped to streamline the dataset.
  8. Preprocessing for Machine Learning:

    • Categorical variables (Pclass, Sex, Embarked, family_size) are converted into numerical format using one-hot encoding (pd.get_dummies), which is necessary for the machine learning algorithms to process them.
  9. Model Building and Prediction:

    • The datasets are split into features (X) and the target label (y) which is Survived for the training set.
    • The training data is further split into training and testing sets using train_test_split.
    • A decision tree classifier is trained on the training set and used to make predictions on the test set.
    • The accuracy of the model is evaluated using accuracy_score.
  10. Prepare Submission:

    • Predictions are made on the test dataset using the trained model.
    • A DataFrame final is created to store these predictions along with PassengerId, and saved to submission.csv, which can be submitted to Kaggle.

This code demonstrates a typical workflow for a Kaggle competition, from loading and inspecting the data, cleaning and preparing the data, to building a predictive model and preparing a submission file.

About

Titanic - Machine Learning from Disaster

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors