# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect the data and save it under outputs/datasets/collection
* Push the files to the github repository.

## Inputs

* Kaggle JSON file - the authentication token.
* Downloaded the Student Performance Data Set by Data-Science Sean on Kaggle.

## Outputs

* The Kaggle files were unzipped to:
    - inputs/datasets/raw/
    - inputs/datasets/raw/

## Additional Comments

* This notebook was written based on the guidelines provided in the walkthrough project 2: 'Churnometer'.
* This notebook relates to the Data Understanding step of Crisp-DM methodology.
* This notebook and the following will represent the learning outcome after following the Code Institute - Predictive Analytics and Machine Learning module.


---

# Change working directory

The following action will change the working directory from its current folder to its parent folder.

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Fetch data from Kaggle


Install Kaggle library with the below command:

In [None]:
! pip install kaggle

* A kaggle account will be required at this point as if the user is doesn't have an account Kaggle will not allow the usage of data.
    - If an account has been created, then the user can download a kaggle.json file.
* The Kaggle.json file contains an authentication token, which is required in order to authenticate a data download from Kaggle.
* The Kaggle.json token file must then be copied to the root directory of the project repository.
* Next, the user must set the Kaggle environment variable and set permission to the toke file to read write for the user.

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

We are using the following dataset - [Student Academic Performance Dataset](https://www.kaggle.com/datasets/larsen0966/student-performance-data-set)

* Get the dataset path from the kaggle url.


In [None]:
KaggleDatasetPath = "larsen0966/student-performance-data-set"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

* The dataset has been downloaded into a zip file which will need to be unzipped to be used.


In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
    && rm {DestinationFolder}/*.zip \
    && rm kaggle.json

---

# Load and Inspect Kaggle Data

In [None]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/student-por.csv")
df.head()

In [None]:
df.info()

* **Abbreviations explained:**

1. **school** - 
    student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)

2. **sex** -
    student's sex (binary: 'F' - female or 'M' - male)

3. **age** -
    student's age (numeric: from 15 to 22)

4. **address** -
    student's home address type (binary: 'U' - urban or 'R' - rural)

5. **famsize** -
    family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)

6. **Pstatus** -
parent's cohabitation status (binary: 'T' - living together or 'A' - apart)

1. **Medu** - mother's education (numeric: 0 = none, 1 = primary education (4th grade), 2 = 5th to 9th grade, 3 = secondary education or 4 = higher education)

2. **Fedu** -
father's education (numeric: 0 = none, 1 = primary education (4th grade), 2 = 5th to 9th grade, 3 = secondary education or 4 = higher education)

1. **Mjob** -
mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')

1.  **Fjob** -
father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')

11.  **reason** -
reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')

1.   **guardian** -
student's guardian (nominal: 'mother', 'father' or 'other')

1.   **traveltime** -
home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)

1.   **studytime** -
weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)

1.   **failures** -
number of past class failures (numeric: n if 1<=n<3, else 4)

16.  **schoolsup** -
extra educational support (binary: yes or no)

17.  **famsup** -
family educational support (binary: yes or no)

1.   **paid** -
extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)

1.   **activities** -
extra-curricular activities (binary: yes or no)

1.   **nursery** -
attended nursery school (binary: yes or no)

1.   **higher** -
wants to take higher education (binary: yes or no)

1.   **internet** -
Internet access at home (binary: yes or no)

1.   **romantic** -
with a romantic relationship (binary: yes or no)

1.   **famrel** -
quality of family relationships (numeric: from 1 - very bad to 5 - excellent)

1.   **freetime** -
free time after school (numeric: from 1 - very low to 5 - very high)

1.   **goout** -
going out with friends (numeric: from 1 - very low to 5 - very high)

1.   **Dalc** -
workday alcohol consumption (numeric: from 1 - very low to 5 - very high)

1.   **Walc** -
weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)

1.   **health** -
current health status (numeric: from 1 - very bad to 5 - very good)

1.   **absences** -
number of school absences (numeric: from 0 to 93)

1.   **G1** -
first period grade (numeric: from 0 to 20)

1.   **G2** -
second period grade (numeric: from 0 to 20)

1.   **G3** -
final grade (numeric: from 0 to 20, output target)

---

# Push files to Repo

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/student-por.csv",index=False)