
The authors of [Introduction to Statistical Learning, 2e](https://www.statlearning.com/) has provided multiple datasets for practical labs and exercises. You can find the R package containing all of these datasets [here](https://cran.r-project.org/web/packages/ISLR2/index.html). 

The __Default__ dataset contains credit card debt information for 10,000 consumers. I've made few minor updates to the dataset: 
* The binary categorical columns __student__ and __default__ contain values __Yes__ and __No__. I've modified updated the dataset slightly to map these values to __1__ and __0__ respectively.
* Removed the first column with the range index

Why did I do these changes? 

I plan to use this dataset in a few blog posts at [Proclus Academy](https://proclusacademy.com). The blog covers basic Machine Learning concepts. And I don't want to include preprocessing steps in the blog as it takes away the focus from the main topic. Hence I'm getting the dataset in a shape I can use directly in the blog.

In [1]:
import pandas as pd

In [2]:
dataset = pd.read_csv('Default_original.csv')
dataset

Unnamed: 0.1,Unnamed: 0,default,student,balance,income
0,1,No,No,729.526495,44361.625074
1,2,No,Yes,817.180407,12106.134700
2,3,No,No,1073.549164,31767.138947
3,4,No,No,529.250605,35704.493935
4,5,No,No,785.655883,38463.495879
...,...,...,...,...,...
9995,9996,No,No,711.555020,52992.378914
9996,9997,No,No,757.962918,19660.721768
9997,9998,No,No,845.411989,58636.156984
9998,9999,No,No,1569.009053,36669.112365


### Remove the index column

In [3]:
# Drop the first column - that's just a range index
dataset.drop(dataset.columns[0], axis='columns', inplace=True)

### Update the `default` column

In [4]:
dataset['default'].value_counts()

No     9667
Yes     333
Name: default, dtype: int64

In [5]:
dataset['default'] = dataset['default'].map({'Yes': 1, 'No': 0})

### Update the `student` column

In [6]:
dataset['student'].value_counts()

No     7056
Yes    2944
Name: student, dtype: int64

In [7]:
dataset['student'] = dataset['student'].map({'Yes': 1, 'No': 0})

In [8]:
dataset

Unnamed: 0,default,student,balance,income
0,0,0,729.526495,44361.625074
1,0,1,817.180407,12106.134700
2,0,0,1073.549164,31767.138947
3,0,0,529.250605,35704.493935
4,0,0,785.655883,38463.495879
...,...,...,...,...
9995,0,0,711.555020,52992.378914
9996,0,0,757.962918,19660.721768
9997,0,0,845.411989,58636.156984
9998,0,0,1569.009053,36669.112365


In [9]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   default  10000 non-null  int64  
 1   student  10000 non-null  int64  
 2   balance  10000 non-null  float64
 3   income   10000 non-null  float64
dtypes: float64(2), int64(2)
memory usage: 312.6 KB


### Save the updated dataset

In [10]:
dataset.to_csv('Default.csv', index=False)

In [11]:
pd.read_csv('Default.csv')

Unnamed: 0,default,student,balance,income
0,0,0,729.526495,44361.625074
1,0,1,817.180407,12106.134700
2,0,0,1073.549164,31767.138947
3,0,0,529.250605,35704.493935
4,0,0,785.655883,38463.495879
...,...,...,...,...
9995,0,0,711.555020,52992.378914
9996,0,0,757.962918,19660.721768
9997,0,0,845.411989,58636.156984
9998,0,0,1569.009053,36669.112365
