# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect the data and save it under outputs/datasets/collection
* Push the files to the github repository.

## Inputs

* Kaggle JSON file - the authentication token.
* Downloaded the Breast Cancer Diagnosis Data Set by M Yasser H on Kaggle.

## Outputs

* The Kaggle files were unzipped to:
  - inputs/datasets/raw/

## Additional Comments

* This notebook was written based on the guidelines provided in the walkthrough project 2: 'Churnometer'.
* This notebook relates to the Data Understanding step of Crisp-DM methodology.
* This notebook and the following will represent the learning outcome after following the Code Institute - Predictive Analytics and Machine Learning module.


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Fetch data from Kaggle

Install Kaggle library with the below command:

In [None]:
! pip install kaggle

* A kaggle account will be required at this point as if the user is doesn't have an account Kaggle will not allow the usage of data.
* If an account has been created, then the user can download a kaggle.json file.
* The Kaggle.json file contains an authentication token, which is required in order to authenticate a data download from Kaggle.
* The Kaggle.json token file must then be copied to the root directory of the project repository.
* Next, the user must set the Kaggle environment variable and set permission to the toke file to read write for the user.

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

We are using the following dataset - [Breast Cancer Diagnosis Dataset](https://www.kaggle.com/datasets/yasserh/breast-cancer-dataset)

Get the dataset path from the kaggle url.

In [None]:
KaggleDatasetPath = "yasserh/breast-cancer-dataset"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

* The dataset has been downloaded into a zip file which will need to be unzipped to be used.

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
    && rm {DestinationFolder}/*.zip \
    && rm kaggle.json

---

# Load and Inspect Kaggle Data

In [None]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/breast-cancer.csv")
df.head(10)

In [None]:
df.info()

**Abbreviations explained:**

`id`
Unique ID


`diagnosis`
Target: M - Malignant B - Benign


`radius_mean`
Radius of Lobes


`texture_mean`
Mean of Surface Texture


`perimeter_mean`
Outer Perimeter of Lobes


`area_mean`
Mean Area of Lobes


`smoothness_mean`
Mean of Smoothness Levels


`compactness_mean`
Mean of Compactness


`concavity_mean`
Mean of Concavity


`concave points_mean`
Mean of Cocave Points


`symmetry_mean`
Mean of Symmetry


`fractal_dimension_mean`
Mean of Fractal Dimension


`radius_se`
SE of Radius


`texture_se`
SE of Texture


`perimeter_se`
Perimeter of SE


`area_se`
Are of SE


`smoothness_se`
SE of Smoothness


`compactness_se`
SE of compactness


`concavity_se`
SEE of concavity


`concave points_se`
SE of concave points


`symmetry_se`
SE of symmetry


`fractal_dimension_se`
SE of Fractal Dimension


`radius_worst`
Worst Radius


`texture_worst`
Worst Texture


`perimeter_worst`
Worst Permimeter


`area_worst`
Worst Area


`smoothness_worst`
Worst Smoothness


`compactness_worst`
Worse Compactness


`concavity_worst`
Worst Concavity


`concave points_worst`
Worst Concave Points


`symmetry_worst`
Worst Symmetry


`fractal_dimension_worst`
Worst Fractal Dimension

### Convert 'diagnosis' values

* We will convert `diagnosis` values from `M` and `B` to `1` and `0` respectively so that the saved dataset will already have the target variable in a numeric format, so any ML model can consume it directly without extra preprocessing later.

In [None]:
df['diagnosis'].unique()

We will convert the diagnosis to binary values

* `M` = 1 (for Malignant)
* `B` = 0 (for Benign)

In [None]:
df['diagnosis'] = df['diagnosis'].map({'M': 1, 'B': 0})
df.tail(3)

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

In [None]:
df.to_csv(f"outputs/datasets/collection/breast-cancer.csv",index=False)