## Optional Jupyter Notebooks – Task 1

Be sure to read the [Introduction to Notebooks](IntroductiontoNotebooks.ipynb)

**Run the cell below to load all the framework for Task1 . Feel free to import any standard library here you will need, per the Environment item above.**
You can safely ignore the "DeprecationWarning" about Pyarrow, it's not an error, just a warning.

In [5]:
import numpy as np
import pandas as pd
import unittest
import os
import sys
import warnings
warnings.filterwarnings("ignore")

task_path = os.path.abspath(os.path.join(os.getcwd(),"..","src"))
sys.path.append(task_path)
from task1 import *
utils_path = os.path.abspath(os.path.join(os.getcwd(),"..","tests"))
sys.path.append(utils_path)
from utils import *

## `find_data_type`
In this function you will take a dataset and the name of a column in it and return what datatype the column is
##### Useful Resources
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html

##### INPUTS
* `dataset` - a pandas DataFrame that contains some data

##### OUTPUTS 
data type of the column (`np.dtype`)

In [8]:
# Write your code here 
def find_data_type(dataset:pd.DataFrame,column_name:str) -> np.dtype:
    return dataset[column_name].dtype

In [9]:
# Run this cell to test your code
dataset = pd.read_csv(os.path.join(os.getcwd(),"..","task1","sample.csv"))
#print(dataset)

if find_data_type(dataset, "target") == np.int64 and find_data_type(dataset, "color") == object and find_data_type(dataset, "version") == np.int64:
    print('Passed...')
else:
    print('Failed...')

Passed...


## `set_index_col`
In this function you will take a dataset and a series and set the index of the dataset to be the series

##### Useful Resources
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.html

##### INPUTS
* `dataset` - a pandas DataFrame that contains some data
* `index` -  a pandas series that contains an index for the dataset

##### OUTPUTS 
a pandas DataFrame indexed by the given index series

In [10]:
# Write your code here
# Set column as index 
def set_index_col(dataset:pd.DataFrame,index:pd.Series) -> pd.DataFrame:
    return dataset.set_index(index)

In [11]:
# Run this cell to test your code
dataset = pd.read_csv(os.path.join(os.getcwd(),"..","task1","sample.csv"))
#print(dataset)
index=dataset.target.to_list()
your_answer = set_index_col(dataset,dataset['target'])
#print(your_answer)
correct_answer_list = dataset.target.tolist()
#print(correct_answer)

if(your_answer.index.to_list() == correct_answer_list):
    print('Passed...')
else:
    print('Failed...')    

Passed...


## `reset_index_col`
In this function you will take a dataset with an index already set and reindex the dataset from 0 to n-1, dropping the old index

##### Useful Resources
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html

##### INPUTS
* `dataset` - a pandas DataFrame that contains some data

##### OUTPUTS 
a pandas DataFrame indexed from 0 to n-1

In [12]:
# Write your code here
# Reset column as index
def reset_index_col(dataset:pd.DataFrame) -> pd.DataFrame:
    return dataset.reset_index(drop=True)

In [13]:
# Run this cell to test your code
dataset = pd.read_csv(os.path.join(os.getcwd(),"..","task1","reset_index_input.csv"), index_col=0)
#print(dataset)
your_answer = reset_index_col(dataset)
#print(your_answer)
correct_answer = pd.read_pickle(os.path.join(os.getcwd(),"..","task1","pkl_files","reset_index_col.pkl"))
#print(correct_answer)

if compare_submission_to_answer_df(your_answer,correct_answer,"reset_index_col"):
    print('Passed...')
else:
    print('Failed...')


RUNNING TEST FOR reset_index_col

Passed...


## `set_col_type`
In this function you will be given a DataFrame, column name and column type. You will edit the dataset to take the column name you are given and set it to be the type given in the input variable

##### Useful Resources
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html

##### INPUTS
* `dataset` - a pandas DataFrame that contains some data
* `column_name` - a string containing the name of a column
* `new_col_type` - a type to change the column to

##### OUTPUTS 
a pandas DataFrame with the column in `column_name` changed to the type in `new_col_type`

In [53]:
# Write your code here
# Set astype (string, int, datetime)
def set_col_type(dataset:pd.DataFrame,column_name:str,new_col_type:type) -> pd.DataFrame:
    dataset[column_name] = dataset[column_name].astype(new_col_type, copy=True)
    return dataset

In [56]:
# Run this cell to test your code
dataset = pd.read_csv(os.path.join(os.getcwd(),"..","task1","sample.csv"))
#print(dataset)
your_answer = set_col_type(dataset, 'target',np.float64)
#print(your_answer)
correct_answer = pd.read_pickle(os.path.join(os.getcwd(),"..","task1","pkl_files","set_index_col_target.pkl"))
#print(correct_answer)
if compare_submission_to_answer_df(your_answer,correct_answer,"set_col_type"):
    print('Passed...')
else:
    print('Failed...')


RUNNING TEST FOR set_col_type

Your Index doesnt match the Answer DF in set_col_type

target

color

version

cost

height

Error with 5 out of 5 columns

Failed...


## `make_DF_from_2d_array`
In this function you will take data in an array as well as column and row labels and use that information to create a pandas DataFrame 

##### Useful Resources
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

##### INPUTS
* `array_2d` - a 2 dimensional numpy array of values
* `column_name_list` - a list of strings holding column names
* `index` - a pandas series holding the row index's

##### OUTPUTS 
a pandas DataFrame with columns set from `column_name_list`, row index set from `index` and data set from `array_2d`

In [27]:
# Write your code here
# Take Matrix of numbers and make it into a dataframe with column name and index numbering
def make_DF_from_2d_array(array_2d:np.array,column_name_list:list[str],index:pd.Series) -> pd.DataFrame:
    return pd.DataFrame(data=array_2d, columns=column_name_list, index=index)

In [28]:
# Run this cell to test your code
array_2d = [[0,1,2],[3,4,5],[6,7,8],[9,10,11],[12,13,14],[15,16,17],[18,19,20],[21,22,23],[24,25,26],[27,28,29]]
column_names = ["A","B","C"]
index = pd.Series([0,5,10,15,20,25,30,35,40,45])
your_answer = make_DF_from_2d_array(array_2d,column_names,index)
#print(your_answer)
correct_answer = pd.read_pickle(os.path.join(os.getcwd(),"..","task1","pkl_files","make_DF_from_2d_array.pkl"))
#print(correct_answer)

if compare_submission_to_answer_df(your_answer,correct_answer,"make_DF_from_2d_array"):
    print('Passed...')
else:
    print('Failed...')


RUNNING TEST FOR make_DF_from_2d_array

Passed...


## `sort_DF_by_column`
In this function, you are given a dataset and column name. You will return a sorted dataset (sorting rows by the value of the specified column) either in descending or ascending order, depending on the value in the `descending` variable.

##### Useful Resources
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html

##### INPUTS
* `dataset` - a pandas DataFrame that contains some data
* `column_name` - a string that contains the column name to sort the data on
* `descending` - a boolean value (`True` or `False`) for if the column should be sorted in descending order

##### OUTPUTS 
a pandas DataFrame sorted by the given column name and in descending or ascending order depending on the value of the `descending` variable

In [29]:
# Write your code here
# Sort Dataframe by values
def sort_DF_by_column(dataset:pd.DataFrame,column_name:str,descending:bool) -> pd.DataFrame:
    return dataset.sort_values(by=column_name, ascending=not descending)

In [30]:
# Run this cell to test your code
dataset = pd.read_csv(os.path.join(os.getcwd(),"..","task1","sample.csv"))
#print(dataset)
your_answer = sort_DF_by_column(dataset, "version", True)
#print(your_answer)
correct_answer = pd.read_pickle(os.path.join(os.getcwd(),"..","task1","pkl_files","sort_DF_by_column.pkl"))
#print(correct_answer)

if compare_submission_to_answer_df(your_answer,correct_answer,"sort_DF_by_column (version)"):
    print('Passed...')
else:
    print('Failed...')


RUNNING TEST FOR sort_DF_by_column (version)

Passed...


## `drop_NA_cols`
In this function you are given a DataFrame you will return a DataFrame with any columns containing `NA` values dropped

##### Useful Resources
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

##### INPUTS
* `dataset` - a pandas DataFrame that contains some data

##### OUTPUTS 
a pandas DataFrame with any columns that contain an `NA` value dropped

In [31]:
# Write your code here
# Drop NA values in dataframe Columns 
def drop_NA_cols(dataset:pd.DataFrame) -> pd.DataFrame:
    return dataset.dropna(axis=1, how='any')

In [32]:
# Run this cell to test your code
dataset = pd.read_csv(os.path.join(os.getcwd(),"..","task1","sample2.csv"))
#print(dataset)
your_answer = drop_NA_cols(dataset)
#print(your_answer)
correct_answer = pd.read_pickle(os.path.join(os.getcwd(),"..","task1","pkl_files","drop_NA_cols.pkl"))
#print(correct_answer)

if compare_submission_to_answer_df(your_answer,correct_answer,"drop_NA_cols)"):
    print('Passed...')
else:
    print('Failed...')


RUNNING TEST FOR drop_NA_cols)

Passed...


## `drop_NA_rows`
In this function you are given a DataFrame you will return a DataFrame with any rows containing `NA` values dropped

##### Useful Resources
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

##### INPUTS
* `dataset` - a pandas DataFrame that contains some data

##### OUTPUTS 
a pandas DataFrame with any rows that contain an `NA` value dropped

In [33]:
# Write your code here
def drop_NA_rows(dataset:pd.DataFrame) -> pd.DataFrame:
    return dataset.dropna()

In [34]:
# Run this cell to test your code
dataset = pd.read_csv(os.path.join(os.getcwd(),"..","task1","sample2.csv"))
#print(dataset)
your_answer = drop_NA_rows(dataset)
#print(your_answer)
correct_answer = pd.read_pickle(os.path.join(os.getcwd(),"..","task1","pkl_files","drop_NA_rows.pkl"))
#print(correct_answer)

if compare_submission_to_answer_df(your_answer,correct_answer,"drop_NA_rows"):
    print('Passed...')
else:
    print('Failed...')


RUNNING TEST FOR drop_NA_rows

Passed...


## `make_new_column`
In this function you are given a dataset, new column name and a static value for the new column add the new column to the dataset and return the dataset

##### Useful Resources
https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/05_add_columns.html

##### INPUTS
* `dataset` - a pandas DataFrame that contains some data
* `new_column_name` - a string containing the name of the new column to be created
* `new_column_value` - a string containing a static value that will be set for the new column for every row 

##### OUTPUTS 
a pandas DataFrame with the new column created named `new_column_name` and filled with the value in `new_column_value`

In [35]:
# Write your code here
def make_new_column(dataset:pd.DataFrame,new_column_name:str,new_column_value:str) -> pd.DataFrame:
    dataset[new_column_name] = new_column_value
    return dataset

In [36]:
# Run this cell to test your code
dataset = pd.read_csv(os.path.join(os.getcwd(),"..","task1","sample.csv"))
#print(dataset)
new_column_name = "New Column"
new_column_values = [12,13,14,15,16,17,18,19,20,21]
your_answer = make_new_column(dataset,new_column_name,new_column_values)
#print(your_answer)
correct_answer = pd.read_pickle(os.path.join(os.getcwd(),"..","task1","pkl_files","make_new_column.pkl"))
#print(correct_answer)

if compare_submission_to_answer_df(your_answer,correct_answer,"make_new_column"):
    print('Passed...')
else:
    print('Failed...')


RUNNING TEST FOR make_new_column

Passed...


## `left_merge_DFs_by_column`
In this function you are given 2 datasets and the name of a column with which you will left join (left dataset is `dataset1` right dataset is `dataset2`) them on using the pandas merge method.

##### Useful Resources
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
https://stackoverflow.com/questions/53645882/pandas-merging-101

##### INPUTS
* `left_dataset` - a pandas DataFrame that contains some data
* `right_dataset` - a pandas DataFrame that contains some data
* `join_col_name` - a string containing the column name to join the two DataFrames on

##### OUTPUTS 
a pandas DataFrame containing the left 2 datasets left joined together

In [37]:
# Write your code here
def left_merge_DFs_by_column(left_dataset:pd.DataFrame,right_dataset:pd.DataFrame,join_col_name:str) -> pd.DataFrame:
    return left_dataset.merge(right_dataset, on=join_col_name, how='left')

In [38]:
# Run this cell to test your code
dataset = pd.read_csv(os.path.join(os.getcwd(),"..","task1","sample.csv"))
#print(dataset)
dataset2 = pd.read_csv(os.path.join(os.getcwd(),"..","task1","sample2.csv"))
#print(dataset2)
merge_col = "version"
your_answer = left_merge_DFs_by_column(dataset, dataset2, merge_col)
#print(your_answer)
correct_answer = pd.read_pickle(os.path.join(os.getcwd(),"..","task1","pkl_files","left_merge_DFs_by_column.pkl"))
#print(correct_answer)

if compare_submission_to_answer_df(your_answer,correct_answer,"left_merge_DFs_by_column"):
    print('Passed...')
else:
    print('Failed...')


RUNNING TEST FOR left_merge_DFs_by_column

Passed...


## `simpleClass`
This project will require you to work with Python Classes. If you are not familiar with them we suggest learning a bit more about them. 

You will take the inputs into the Class initialization and set them as instance variables (of the same name) in the python class

##### Useful Resources
https://www.w3schools.com/python/python_classes.asp

##### INPUTS
* `length` - an integer
* `width` - an integer
* `height` - an integer

##### OUTPUTS 
None

In [39]:
# Write your code here
class simpleClass():
    # TODO: Read https://github.gatech.edu/pages/cs6035-tools/cs6035-tools.github.io/Projects/Machine_Learning/Task1.html and implement the function as described
    def __init__(self, length:int, width:int, height:int):
        self.length = length
        self.width = width
        self.height = height

In [40]:
# Run this cell to test your code

sc = simpleClass(length=1, width=2, height=3)

if sc.length == 1 and sc.width == 2 and sc.height == 3:
    print('Passed...')
else:
    print('Failed...')

Passed...


## `find_dataset_statistics`
Now that you have learned a bit about pandas DataFrames, you can start using them to generate some simple summary statistics for a DataFrame. You will be given the dataset as an input variable, as well as a column name for a column in the dataset that contains binary values (0 for negative and 1 for positive) that you will summarize.

##### Useful Resources
* https://www.learndatasci.com/glossary/binary-classification/
* https://developers.google.com/machine-learning/crash-course/framing/ml-terminology

##### INPUTS
* `dataset` - a pandas DataFrame that contains some data
* `label_col` - a string containing the name of the `label` column

##### OUTPUTS 
* `n_records` (int) - the number of rows in the dataset 
* `n_columns` (int) - the number of columns in the dataset
* `n_negative` (int) - the number of "negative" samples in the dataset **(`label` column equals 0)**
* `n_positive` (int) - the number of "positive" samples in the dataset **(`label` column equals 1)**
* `perc_positive` (float) - the percentage (out of 100%) of positive samples in the dataset 

In [48]:
# Write your code here
def find_dataset_statistics(dataset:pd.DataFrame,label_col:str) -> tuple[int,int,int,int,float]:
    n_records = len(dataset)
    n_columns = len(dataset.columns)
    n_negative = (dataset[label_col] == 0).sum()
    n_positive = (dataset[label_col] == 1).sum()
    perc_positive =  100 * (n_positive / n_records)

    return n_records,n_columns,n_negative,n_positive,perc_positive

In [49]:
# Run this cell to test your code
dataset = pd.read_csv(os.path.join(os.getcwd(),"..","task1","sample.csv"))
target_col = "target"
n_records,n_columns,n_negative,n_positive,perc_positive = find_dataset_statistics(dataset,target_col)

if n_records == 10 and n_columns == 5 and n_negative == 5 and n_positive == 5 and perc_positive == 50:
    print('Passed...')
else:
    print('Failed...')

Passed...


# You have successfully reached the end of this notebook.