This project demonstrates the use of exploratory data analysis techniques to extract meaningful insights from a dataset. The analysis is performed using a Jupyter Notebook, combining data manipulation, visualization, and statistical exploration. We pre-process + clean the dataset making it ready for model training. The model trained on this dataset will be a mulit-class classifier for 4 different breast cancer types.
- Data Loading and Cleaning: Initial exploration of the dataset, handling missing values, and fixing data inconsistencies.
- Statistical Summary: Calculation of descriptive statistics (mean, median, standard deviation, etc.).
- Data Visualization: Generation of plots (e.g., histograms, scatter plots, box plots) for understanding data distributions and relationships.
- Correlation Analysis: Investigation of relationships between variables.
- Key Findings: Highlights and insights derived from the analysis.
To run this notebook, ensure you have the following installed:
- Python 3.8+
- Jupyter Notebook or Jupyter Lab
- The following Python libraries:
pandasnumpymatplotlibseabornscipy(optional, if advanced statistical tests are included)
You can install the dependencies using:
pip install pandas numpy matplotlib seaborn scipy- Clone the repository or download the notebook file.
- Open the notebook in VS Code or any preferred IDE.
- Open the EDA.ipynb file.
- Run each cell sequentially to perform the analysis.
- Name: data.csv
- Description: This dataset contains clinical and genomic data for patients with breast cancer. The dataset includes four types of breast cancer.
The results are documented in the notebook as inline comments and visualizations. For further analysis or reporting, you can export the figures and data summaries.
For questions or feedback, please reach out to:
- [Karim Khalil]
- [k.khalil@zeroandone.me]
- [https://github.com/KKhalil01]