Skip to content

Latest commit

 

History

History

.ipynb_checkpoints

Titanic Dataset Analysis

Introduction

This project involves the analysis the Titanic dataset, which is a well-known dataset in the field of data science and machine learning. The dataset provides information on the passengers aboard the Titanic, including their demographics, class, and whether they survived the disaster.

Dataset

The dataset can be found on Kaggle. It consists of two CSV files:

  • train.csv: This is the training set, which contains the details of a subset of the passengers, including whether they survived or not.
  • test.csv: This is the test set, which contains similar details but without the survival information.

Data Dictionary

The columns in the dataset are as follows:

  • PassengerId: Unique ID for each passenger.
  • Survived: Survival (0 = No, 1 = Yes).
  • Pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd).
  • Name: Name of the passenger.
  • Sex: Gender of the passenger.
  • Age: Age of the passenger.
  • SibSp: Number of siblings/spouses aboard the Titanic.
  • Parch: Number of parents/children aboard the Titanic.
  • Ticket: Ticket number.
  • Fare: Passenger fare.
  • Cabin: Cabin number.
  • Embarked: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).

Project Structure

The project is structured as follows:

titanic-analysis/
├── data/
│   ├── train.csv
│   ├── test.csv
├── notebooks/
│   ├── data_exploration.ipynb
│   ├── data_cleaning.ipynb
│   ├── feature_engineering.ipynb
│   ├── model_training.ipynb
│   ├── model_evaluation.ipynb
├── src/
│   ├── data_processing.py
│   ├── feature_engineering.py
│   ├── model.py
├── README.md
├── requirements.txt
└── .gitignore

Notebooks

  • data_exploration.ipynb: Initial exploration of the dataset to understand its structure and contents.
  • data_cleaning.ipynb: Data cleaning processes including handling missing values, outliers, and incorrect data types.
  • feature_engineering.ipynb: Creation of new features from the existing ones to improve the performance of machine learning models.
  • model_training.ipynb: Training various machine learning models on the cleaned and processed data.
  • model_evaluation.ipynb: Evaluation of the trained models using appropriate metrics.

Scripts

  • data_processing.py: Contains functions for data loading, cleaning, and preprocessing.
  • feature_engineering.py: Contains functions for feature creation and transformation.
  • model.py: Contains functions for model building, training, and evaluation.

Getting Started

Prerequisites

Ensure you have Python 3.6 or above installed. You can use the requirements.txt file to install the necessary dependencies.

pip install -r requirements.txt

Running the Notebooks

You can start by running the notebooks in the following order:

  1. data_exploration.ipynb
  2. data_cleaning.ipynb
  3. feature_engineering.ipynb
  4. model_training.ipynb
  5. model_evaluation.ipynb

Contributing

If you wish to contribute to this project, please fork the repository and submit a pull request with your changes. Make sure to include a detailed description of what you've done.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Acknowledgments

  • Kaggle for providing the dataset.
  • The data science community for their continuous support and inspiration.

This README.md file provides an overview of the project, the dataset, and instructions on how to get started with the analysis. Feel free to customize it further based on your specific project details.