# Case Study - 1 


## Project Introduction: Big Data Real-Time Analytics with Python and Spark 

This project showcases the practical application of Python and Spark in managing and analysing large-scale data, emphasising the importance of data preparation for successful analytics.


Dataset Description
1. Lending Club Loans: A dataset of thousands of loans made through the Lending Club platform, highlighting borrower risk and loan conditions.
2. Exchange Rates: Historical exchange rates between the US Dollar and the Euro from Yahoo Finance.

Objectives
* Data Cleaning: Preparing the datasets by handling missing values and inconsistencies.
* Real-Time Analytics: Using Python and Spark for real-time data processing and analysis.
* Insight Generation: Extracting meaningful insights for decision-making.
  
This project leverages Big Data technologies for real-time analytics using Python and Spark, guided by the Data Science Academy.

References
- [Lending Club Dataset](https://www.openintro.org/data/index.php?data=loans_full_schema)
- [Yahoo Finance Exchange Rate Data](https://finance.yahoo.com/)

In [1]:
!pip install -q -U watermark

In [2]:
import numpy as np

In [3]:
import warnings
warnings.filterwarnings('ignore')

In [4]:
%reload_ext watermark
%watermark -a "Zelly Irigon" --iversions

Author: Zelly Irigon

numpy: 1.26.4




[set_printoptions numpy doc](https://numpy.org/doc/stable/reference/generated/numpy.set_printoptions.html#numpy-set-printoptions)

In [5]:
# Numpy print configuration
np.set_printoptions(suppress = True, linewidth = 200, precision = 2)

## Loading the dataset
- https://numpy.org/doc/stable/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt

In [6]:
data = np.genfromtxt("data_set/dataset1.csv",
                     delimiter = ';',
                     skip_header = 1,
                     autostrip = True, #remove spaces
                     encoding = 'cp1252')

In [7]:
type(data)

numpy.ndarray

In [8]:
data.shape

(10000, 14)

In [9]:
data.view()

array([[48010226.  ,         nan,    35000.  , ...,         nan,         nan,     9452.96],
       [57693261.  ,         nan,    30000.  , ...,         nan,         nan,     4679.7 ],
       [59432726.  ,         nan,    15000.  , ...,         nan,         nan,     1969.83],
       ...,
       [50415990.  ,         nan,    10000.  , ...,         nan,         nan,     2185.64],
       [46154151.  ,         nan,         nan, ...,         nan,         nan,     3199.4 ],
       [66055249.  ,         nan,    10000.  , ...,         nan,         nan,      301.9 ]])

Notice how several columns above are of the nan type. This is due to special characters in the dataset and the way Numpy loads numeric and string data. Next, we will address this.

### Checking for missing values

In [10]:
np.isnan(data).sum()

88005

In [11]:
# Returns the highest value + 1 ignoring nan values.
# This arbitrary value will be used to fill missing values at the time of loading numeric variable data. 
# At a later stage, this value will be treated as a missing value.
arbitrary_value = np.nanmax(data) +1
arbitrary_value

68616520.0

In [12]:
# Calculating the mean (numeric variables) ignoring nan values per column. I will use this to separate numeric variables from string-type variables.
mean_ignoring_nan = np.nanmean(data, axis = 0)
mean_ignoring_nan

array([54015809.19,         nan,    15273.46,         nan,    15311.04,         nan,       16.62,      440.92,         nan,         nan,         nan,         nan,         nan,     3143.85])

In [13]:
# String-type columns with missing values
string_columns = np.argwhere(np.isnan(mean_ignoring_nan)).squeeze()
string_columns

array([ 1,  3,  5,  8,  9, 10, 11, 12], dtype=int64)

In [14]:
# Numeric Columns
numeric_columns = np.argwhere(np.isnan(mean_ignoring_nan) == False).squeeze()
numeric_columns

array([ 0,  2,  4,  6,  7, 13], dtype=int64)

### The dataset is imported again, separating string-type columns from numeric columns

In [15]:
## Load the string-type columns
arr_strings = np.genfromtxt('data_set/dataset1.csv',
                            delimiter = ';',
                            skip_header = 1,
                            autostrip = True,
                            usecols = string_columns,
                            dtype = str,
                            encoding = 'cp1252')

In [16]:
arr_strings

array([['May-15', 'Current', '36 months', ..., 'Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=48010226', 'CA'],
       ['', 'Current', '36 months', ..., 'Source Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=57693261', 'NY'],
       ['Sep-15', 'Current', '36 months', ..., 'Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=59432726', 'PA'],
       ...,
       ['Jun-15', 'Current', '36 months', ..., 'Source Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=50415990', 'CA'],
       ['Apr-15', 'Current', '36 months', ..., 'Source Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=46154151', 'OH'],
       ['Dec-15', 'Current', '36 months', ..., '', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=66055249', 'IL']], dtype='<U69')

In [29]:
## Load the numeric columns by filling in missing values
arr_numeric = np.genfromtxt("data_set/dataset1.csv",
                            delimiter = ';',
                            skip_header = 1,
                            autostrip = True,
                            usecols = numeric_columns,
                            filling_values = arbitrary_value,
                            encoding = 'cp1252')
arr_numeric

array([[48010226.  ,    35000.  ,    35000.  ,       13.33,     1184.86,     9452.96],
       [57693261.  ,    30000.  ,    30000.  , 68616520.  ,      938.57,     4679.7 ],
       [59432726.  ,    15000.  ,    15000.  , 68616520.  ,      494.86,     1969.83],
       ...,
       [50415990.  ,    10000.  ,    10000.  , 68616520.  , 68616520.  ,     2185.64],
       [46154151.  , 68616520.  ,    10000.  ,       16.55,      354.3 ,     3199.4 ],
       [66055249.  ,    10000.  ,    10000.  , 68616520.  ,      309.97,      301.9 ]])

### Extracting column names

In [35]:
## Load column names
arr_column_names = np.genfromtxt('data_set/dataset1.csv',
                                 delimiter = ';',
                                 autostrip = True,
                                 skip_footer = data.shape[0],
                                 dtype = str,
                                 encoding = 'cp1252')
arr_column_names

array(['id', 'issue_d', 'loan_amnt', 'loan_status', 'funded_amnt', 'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'verification_status', 'url', 'addr_state', 'total_pymnt'], dtype='<U19')

In [32]:
## Separating numeric and string column headers 
header_strings, header_numeric = arr_column_names[string_columns], arr_column_names[numeric_columns]

In [33]:
header_strings

array(['issue_d', 'loan_status', 'term', 'grade', 'sub_grade', 'verification_status', 'url', 'addr_state'], dtype='<U19')

In [36]:
header_numeric

array(['id', 'loan_amnt', 'funded_amnt', 'int_rate', 'installment', 'total_pymnt'], dtype='<U19')

## Checkpoint Function

- Using a checkpoint function is not mandatory, but it is a good practice. This function will save the current state of my strings array. If I encounter any issues later in the data processing, I can revert to the file generated by this function and recover the information from that point.

In [37]:
# Function
def checkpoint(file_name, checkpoint_header, checkpoint_data):
    np.savez(file_name, header = checkpoint_header, data = checkpoint_data)
    checkpoint_variable = np.load(file_name + ".npz")
    return(checkpoint_variable)

In [38]:
initial_checkpoint = checkpoint('data_set/initial_checkpoint', header_strings, arr_strings)

In [39]:
initial_checkpoint['data']

array([['May-15', 'Current', '36 months', ..., 'Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=48010226', 'CA'],
       ['', 'Current', '36 months', ..., 'Source Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=57693261', 'NY'],
       ['Sep-15', 'Current', '36 months', ..., 'Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=59432726', 'PA'],
       ...,
       ['Jun-15', 'Current', '36 months', ..., 'Source Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=50415990', 'CA'],
       ['Apr-15', 'Current', '36 months', ..., 'Source Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=46154151', 'OH'],
       ['Dec-15', 'Current', '36 months', ..., '', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=66055249', 'IL']], dtype='<U69')

In [42]:
## Validation if the new file has the same content as my arr_strings
np.array_equal(initial_checkpoint['data'], arr_strings)

True

## Manipulating the String-Type Columns

In [43]:
header_strings

array(['issue_d', 'loan_status', 'term', 'grade', 'sub_grade', 'verification_status', 'url', 'addr_state'], dtype='<U19')

In [44]:
## Adjusting the name of the 'issue_d' column to make it easier to identify
header_strings[0] = 'issue_date'

In [45]:
header_strings

array(['issue_date', 'loan_status', 'term', 'grade', 'sub_grade', 'verification_status', 'url', 'addr_state'], dtype='<U19')

In [46]:
arr_strings

array([['May-15', 'Current', '36 months', ..., 'Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=48010226', 'CA'],
       ['', 'Current', '36 months', ..., 'Source Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=57693261', 'NY'],
       ['Sep-15', 'Current', '36 months', ..., 'Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=59432726', 'PA'],
       ...,
       ['Jun-15', 'Current', '36 months', ..., 'Source Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=50415990', 'CA'],
       ['Apr-15', 'Current', '36 months', ..., 'Source Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=46154151', 'OH'],
       ['Dec-15', 'Current', '36 months', ..., '', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=66055249', 'IL']], dtype='<U69')

### Pre-Processing issue_date variable using Label Encoding

In [47]:
## Extracting the unique values
np.unique(arr_strings[:,0])

array(['', 'Apr-15', 'Aug-15', 'Dec-15', 'Feb-15', 'Jan-15', 'Jul-15', 'Jun-15', 'Mar-15', 'May-15', 'Nov-15', 'Oct-15', 'Sep-15'], dtype='<U69')

In [48]:
## Removing the -15 suffix and converting it to a string array
arr_strings[:,0] = np.chararray.strip(arr_strings[:,0], '-15')

In [49]:
## Extracting the unique values
np.unique(arr_strings[:,0])

array(['', 'Apr', 'Aug', 'Dec', 'Feb', 'Jan', 'Jul', 'Jun', 'Mar', 'May', 'Nov', 'Oct', 'Sep'], dtype='<U69')

In [53]:
## Creating an array with months(including as an empty element for those that are blank)
months = np.array(['','Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug','Sep','Oct','Nov','Dec'])

In [54]:
## Loop to convert the name of months to numeric values
## It's called Label Encoding
for i in range(13):
    arr_strings[:,0] = np.where(arr_strings[:,0] == months[i], i, arr_strings[:,0])
    

In [55]:
np.unique(arr_strings[:,0])

array(['0', '1', '10', '11', '12', '2', '3', '4', '5', '6', '7', '8', '9'], dtype='<U69')