# Introduction

**Pandas** is a powerful and open-source Python library used for data manipulation and analysis. The name "Pandas" is derived from two sources: "Panel Data" and "Python Data Analysis"

The **Pandas** library in Python gets its name from two sources, combining its purpose and functionality:

**1. Data Analysis and Manipulation:**
   - The name "Pandas" is derived from the term **"Panel Data,"** which refers to multidimensional structured datasets commonly used in econometrics and statistics.
   - "Pandas" also resembles the word "data," emphasizing its primary use in data manipulation and analysis.

**2. Catchy and Memorable Name:**
   - The name "Pandas" is easy to remember, unique, and has a friendly appeal, much like the animal panda.

Overall, the name reflects the library's focus on working with structured data efficiently while being approachable for users.

It provides high-performance, easy-to-use data structures and functions specifically designed for working with structured data, such as tables or time series.

**Key Features of Pandas:**

**1. Data Structures:**

*   Series: A one-dimensional labeled array capable of holding any data type (e.g.,
integers, strings, floats, etc.).
*   DataFrame: A two-dimensional labeled data structure, similar to a table in a database or an Excel spreadsheet.

**2. Data Manipulation:**

*   Easy handling of missing data.
*   Tools for reshaping, merging, and joining datasets.
*   Data filtering and subsetting.

**3. Data Analysis:**

*   Grouping and aggregations.
*   Statistical operations like mean, median, standard deviation, etc.
*   Data visualization through integration with libraries like Matplotlib and Seaborn.

**4. File I/O Operations:**
*   Reading and writing data to/from CSV, Excel, SQL databases, JSON, and other formats.

**5. Time Series Data:**

*   Built-in support for handling date and time data.

## **Section 1: Basic Operations**

Once you’ve loaded or created a dataset using Pandas, you can perform various operations to explore, manipulate, and analyze the data. Here's a quick overview of basic operations with examples:

In [None]:
!pip install pandas

Sometimes, you might encounter an error like **"ModuleNotFoundError: No module named 'pandas'"**.
Run the code below to install the pandas library and avoid this error.

In [3]:
# Import libraries
import pandas as pd

In [None]:
# Read data from the csv file
dataset = pd.read_csv('loan_small.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'loan_small.csv'

In [None]:
#look at the all values
dataset

Unnamed: 0,Loan_ID,Gender,ApplicantIncome,CoapplicantIncome,LoanAmount,Area,Loan_Status
0,LP001002,,5849.0,0.0,,urban,Y
1,LP001003,Male,4583.0,,128.0,semi,N
2,LP001005,Male,3000.0,0.0,66.0,,Y
3,LP001006,Female,2583.0,2358.0,120.0,semi,
4,LP001008,Male,,0.0,141.0,urban,Y
5,LP001011,Male,5417.0,4196.0,267.0,semi,Y
6,LP001013,Male,2333.0,1516.0,,rural,Y
7,LP001014,Female,3036.0,2504.0,158.0,semi,N
8,LP001018,Male,4006.0,1526.0,168.0,rural,Y
9,LP001020,Male,12841.0,10968.0,349.0,semi,N


In [None]:
#Look at the first 5 values
dataset.head()

Unnamed: 0,Loan_ID,Gender,ApplicantIncome,CoapplicantIncome,LoanAmount,Area,Loan_Status
0,LP001002,,5849.0,0.0,,urban,Y
1,LP001003,Male,4583.0,,128.0,semi,N
2,LP001005,Male,3000.0,0.0,66.0,,Y
3,LP001006,Female,2583.0,2358.0,120.0,semi,
4,LP001008,Male,,0.0,141.0,urban,Y


**dataset.head()** displays the first 5 rows of a DataFrame by default, allowing you to quickly preview the data structure, column names, and sample values.

In [None]:
# Access the data using iloc.
# Example - Get first three rows from the second and third column
subset = dataset.iloc[0:3, 1:3]

This code uses iloc to access specific rows and columns in a DataFrame by their integer-based index positions.

**Explanation:**
*   dataset.iloc[0:3, 1:3]: Selects rows 0 to 2 (the first three rows) and columns 1 to 2 (the second and third columns, as Python indexing is exclusive at the end).
*   subset: Stores the selected portion of the dataset.

In [None]:
subset

Unnamed: 0,Gender,ApplicantIncome
0,,5849.0
1,Male,4583.0
2,Male,3000.0


In [None]:
# Access the data using column names
# Get all rows of the column Gender and ApplicantIncome
subsetN = dataset[['Gender', 'ApplicantIncome']]

This code accesses specific columns in a DataFrame using their names.

**Explanation:**


*   dataset[['Gender', 'ApplicantIncome']]: Selects all rows for the columns named "Gender" and "ApplicantIncome".
*   subsetN: Stores the resulting subset of the DataFrame containing only the specified columns.

In [None]:
subsetN

Unnamed: 0,Gender,ApplicantIncome
0,,5849.0
1,Male,4583.0
2,Male,3000.0
3,Female,2583.0
4,Male,
5,Male,5417.0
6,Male,2333.0
7,Female,3036.0
8,Male,4006.0
9,Male,12841.0


In [None]:
# Get first three rows of the columns Gender and ApplicantIncome
subsetN = dataset[['Gender', 'ApplicantIncome']][0:3]

In [None]:
subsetN

Unnamed: 0,Gender,ApplicantIncome
0,,5849.0
1,Male,4583.0
2,Male,3000.0


In [None]:
# Get the Shape of the dataframe (Row x Columns)
dataset.shape

(16, 7)

In [None]:
# Get column names of the dataframe
dataset.columns

Index(['Loan_ID', 'Gender', 'ApplicantIncome', 'CoapplicantIncome',
       'LoanAmount', 'Area', 'Loan_Status'],
      dtype='object')

In [None]:
# Get column names of the dataframe
dataset.columns.to_list()

['Loan_ID',
 'Gender',
 'ApplicantIncome',
 'CoapplicantIncome',
 'LoanAmount',
 'Area',
 'Loan_Status']

# **Section 2: Handling missing values**

Handling missing values is a crucial step in data preprocessing to ensure data quality and avoid errors during analysis or modeling. Below are common methods to handle missing values in a dataset:

In [None]:
# Find out columns with missing values
dataset.isnull()
# it will project 'True' where there is a missing value, can make things difficult to read via naked eyes

Unnamed: 0,Loan_ID,Gender,ApplicantIncome,CoapplicantIncome,LoanAmount,Area,Loan_Status
0,False,True,False,False,True,False,False
1,False,False,False,True,False,False,False
2,False,False,False,False,False,True,False
3,False,False,False,False,False,False,True
4,False,False,True,False,False,False,False
5,False,False,False,False,False,False,False
6,False,False,False,False,True,False,False
7,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False


In [None]:
# Here, we will find missing values count for each column
dataset.isnull().sum(axis=0)

Loan_ID              0
Gender               1
ApplicantIncome      2
CoapplicantIncome    1
LoanAmount           3
Area                 1
Loan_Status          1
dtype: int64

In [None]:
# Drop missing values from a particular row say for e.g. LoanAmount and Area
dataset_clean = dataset.dropna(subset=['LoanAmount', 'Area'])

In [None]:
dataset_clean.head()

Unnamed: 0,Loan_ID,Gender,ApplicantIncome,CoapplicantIncome,LoanAmount,Area,Loan_Status
1,LP001003,Male,4583.0,,128.0,semi,N
3,LP001006,Female,2583.0,2358.0,120.0,semi,
4,LP001008,Male,,0.0,141.0,urban,Y
5,LP001011,Male,5417.0,4196.0,267.0,semi,Y
7,LP001014,Female,3036.0,2504.0,158.0,semi,N


In [None]:
# Drop all the rows with missing values
dataset_clean = dataset.dropna()

In [None]:
dataset_clean.head()

Unnamed: 0,Loan_ID,Gender,ApplicantIncome,CoapplicantIncome,LoanAmount,Area,Loan_Status
5,LP001011,Male,5417.0,4196.0,267.0,semi,Y
7,LP001014,Female,3036.0,2504.0,158.0,semi,N
8,LP001018,Male,4006.0,1526.0,168.0,rural,Y
9,LP001020,Male,12841.0,10968.0,349.0,semi,N
10,LP001024,Female,3200.0,700.0,70.0,urban,Y


In [None]:
#Create a duplicate of existing dataset\
dt = dataset.copy()

In [None]:
dt.head()

Unnamed: 0,Loan_ID,Gender,ApplicantIncome,CoapplicantIncome,LoanAmount,Area,Loan_Status
0,LP001002,,5849.0,0.0,,urban,Y
1,LP001003,Male,4583.0,,128.0,semi,N
2,LP001005,Male,3000.0,0.0,66.0,,Y
3,LP001006,Female,2583.0,2358.0,120.0,semi,
4,LP001008,Male,,0.0,141.0,urban,Y


In [None]:
# Replace missing categorical values using column names

#Step1: Create a list of categorical column
cols = ['Gender', 'Area', 'Loan_Status',]

#Step2: fillna for filling NaN values using mode (i.e. most frequent count)
dt[cols] = dt[cols].fillna(dt.mode().iloc[0])

#Step3: Check for count of missing values in categorial column
dt.isnull().sum(axis=0)

Loan_ID              0
Gender               0
ApplicantIncome      2
CoapplicantIncome    1
LoanAmount           3
Area                 0
Loan_Status          0
dtype: int64

In [None]:
# Replace missing numerical values using column names

#Step1: Create a list of numerical column
cols2 = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount']

#Step2: fillna for filling NaN values using mean or median (i.e. average or middle value)
dt[cols2] = dt[cols2].fillna(dt[cols2].mean())

#Step3: Check for count of missing values in numerical column
dt.isnull().sum(axis=0)

Loan_ID              0
Gender               0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Area                 0
Loan_Status          0
dtype: int64

# **Section 3: Encoding Techniques**

Encoding techniques convert categorical values into numerical values, which is essential because most machine learning algorithms require numerical input.

**Type 1: Label Encoding**

Label encoding is a method to transform categorical data into numerical data by assigning a unique integer to each category. It is best suited for categorical features where there is a natural order or ranking between the categories.Label encoding is used where we have any order between categories. Here's a brief overview of how it works:

**How It Works:**

**Assign unique integers:** Each category is assigned a unique integer value.
Example:
*   "low_salary" → 0
*   "medium_salary" → 1
*   "high_salary" → 2

**Replace categorical values:** The original categories in the dataset are replaced by their corresponding integers.

**When to Use:**

Label encoding works well when:

*   The categories have a natural order (e.g., low, medium, high).
*   The encoding does not introduce unintended bias (since algorithms may interpret the numerical values as ordinal).

***Here are some common algorithms that can use label encoding:***

*   **Decision Trees:** Algorithms like Decision Tree Classifier and Decision Tree Regressor can handle label-encoded data well.
*   **Random Forest:** Both Random Forest Classifier and Random Forest Regressor can work with label-encoded data.
*   **Gradient Boosting:** Algorithms like Gradient Boosting Machines (GBM), XGBoost, LightGBM, and CatBoost can use label-encoded data.
*   **Support Vector Machines (SVM):** SVM classifiers can work with label-encoded data.
*   **K-Nearest Neighbors (KNN):** KNN classifiers and regressors can use label-encoded data.
*   **Naive Bayes:** Naive Bayes classifiers can handle label-encoded data.
*   **Neural Networks:** Neural networks, including deep learning models, can use label-encoded data.ed data.

In [None]:
# Get datatypes of all the columns of the dataframe
dt.dtypes

Loan_ID               object
Gender                object
ApplicantIncome      float64
CoapplicantIncome    float64
LoanAmount           float64
Area                  object
Loan_Status           object
dtype: object

In [None]:
# Convert string/object column types to categorical
dt[cols] = dt[cols].astype('category')

In [None]:
dt[cols].dtypes

Gender         category
Area           category
Loan_Status    category
dtype: object

In [None]:
# Convert string to numerical codes
for columns in cols:
    dt[columns] = dt[columns].cat.codes

In [None]:
dt.dtypes    #see columns like Gender, Area and Loan_Status showing is int8 type

Loan_ID               object
Gender                  int8
ApplicantIncome      float64
CoapplicantIncome    float64
LoanAmount           float64
Area                    int8
Loan_Status             int8
dtype: object

In [None]:
dt.head()   ##see columns like Gender, Area and Loan_Status showing integer values for categories now as 0,1,2 and so on..

Unnamed: 0,Loan_ID,Gender,ApplicantIncome,CoapplicantIncome,LoanAmount,Area,Loan_Status
0,LP001002,1,5849.0,0.0,140.923077,2,1
1,LP001003,1,4583.0,2509.333333,128.0,1,0
2,LP001005,1,3000.0,0.0,66.0,1,1
3,LP001006,0,2583.0,2358.0,120.0,1,1
4,LP001008,1,4103.571429,0.0,141.0,2,1


# **Type 2: One-Hot Encoding Techniques (Dummy variables)**

**One-hot encoding** is a method to convert categorical data into a format suitable for machine learning algorithms. It creates binary columns for each category, ensuring the categorical information is represented numerically without introducing an ordinal relationship.

**When to Use:**

One-hot encoding is used when there is no inherent order among the categories.
For example: "red", "green", and "blue" represent different colors without any ranking or order.

**Here's how it works:**

1.   **Create binary columns:** For each unique category value in a column, a new binary column (0 or 1) is created.
Example:

*   **Original column:** "red", "green", "blue"
*   **Transformed columns:**

          *   is_red: 1 if the category is "red", else 0.
          *   is_green: 1 if the category is "green", else 0.
          *   is_blue: 1 if the category is "blue", else 0.

2.   **Replace the original column:** The original categorical column is replaced with the newly created binary columns.

**Common algorithms include:**

One-hot encoding works well with most machine learning algorithms, including:

*   **Linear Models:** Linear Regression, Logistic Regression.
*   **Tree-Based Models:** Decision Trees, Random Forest, Gradient Boosting (XGBoost, LightGBM, etc.).
*   **Support Vector Machines (SVM):** Suitable for both classification and regression.
*   **K-Nearest Neighbors (KNN):** Handles binary features effectively.
Naive Bayes: Works well with binary inputs.
*   **Neural Networks:** Fully compatible with one-hot encoded features.

In [None]:
# Drop a column using column name (this column has high cardinality, i.e. lots of category and I do not want these to taken for one-hot encoding)
df2 = dataset.drop(['Loan_ID'], axis=1)

In [None]:
df2.head()

Unnamed: 0,Gender,ApplicantIncome,CoapplicantIncome,LoanAmount,Area,Loan_Status
0,,5849.0,0.0,,urban,Y
1,Male,4583.0,,128.0,semi,N
2,Male,3000.0,0.0,66.0,,Y
3,Female,2583.0,2358.0,120.0,semi,
4,Male,,0.0,141.0,urban,Y


In [None]:
# using get_dummies function of Pandas
df2 = pd.get_dummies(df2)

In [None]:
df2.head()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Gender_Female,Gender_Male,Area_rural,Area_semi,Area_urban,Loan_Status_N,Loan_Status_Y
0,5849.0,0.0,,False,False,False,False,True,False,True
1,4583.0,,128.0,False,True,False,True,False,True,False
2,3000.0,0.0,66.0,False,True,False,False,False,False,True
3,2583.0,2358.0,120.0,True,False,False,True,False,False,False
4,,0.0,141.0,False,True,False,False,True,False,True


# **Dummy Variable Trap**

The dummy variable trap occurs when you create too many dummy variables, leading to multicollinearity. This happens when the dummy variables are perfectly correlated, causing issues in regression analysis. For instance, if you have a categorical variable with three categories (e.g., "Red," "Green," "Blue"), creating three dummy variables (one for each category) will result in perfect multicollinearity because the sum of these dummy variables will always be 1 leading to over fitting.

**How to Minimize the Dummy Variable Trap**

To avoid the dummy variable trap, you should create ( k-1 ) dummy variables if you have ( k ) categories. This means you drop one dummy variable and use it as a reference category.

**Here's an example:**

Suppose you have a categorical variable "Color" with three categories: "Red," "Green," and "Blue." You can create K-1 = 3-1 = 2 dummy variables to avoid multi-collinearity. The third column below, though not present automatically indicates the third choice!

In [None]:
# Avoid dummy variable trap using drop_first

#Step1: Creating a copy datatset:
df3 = dataset.drop(['Loan_ID'], axis=1)

#Step2: Adding dummy columns using k-1 by adding a property 'drop_first=True'
df3 = pd.get_dummies(df3, drop_first=True)

In [None]:
df3.head()

#compare it with df2 and you would see 'Gender_Female', 'Area_rural' and 'Loan_Status_N' dropped to avoid multi-collinearity/overfitting trap.

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Gender_Male,Area_semi,Area_urban,Loan_Status_Y
0,5849.0,0.0,,False,False,True,True
1,4583.0,,128.0,True,True,False,False
2,3000.0,0.0,66.0,True,False,False,True
3,2583.0,2358.0,120.0,False,True,False,False
4,,0.0,141.0,True,False,True,True


# **Section 4: Data Normalization**

**Data normalization** is a preprocessing technique used to scale numerical data to a standard range, typically between 0 and 1, or to a standard distribution. This process helps improve the performance and training stability of machine learning models by ensuring that all features contribute equally to the model's learning process.

**This is only performed on numerical columns (not on category columns)**

Categorical columns are usually handled differently as we learned earlier, such as through label encoding or one-hot encoding, to convert them into a numerical format before using them in machine learning models.

In [None]:
# extract data to scale
data_to_scale = dataset_clean.iloc[:, 2:5]

In [None]:
data_to_scale.head()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount
5,5417.0,4196.0,267.0
7,3036.0,2504.0,158.0
8,4006.0,1526.0,168.0
9,12841.0,10968.0,349.0
10,3200.0,700.0,70.0


# **Type 1: Standard Scalar**

Standard Scaler is a feature scaling technique used to standardize numerical data by removing the mean and scaling it to unit variance. This is crucial when features in the dataset have different scales, as it ensures that each feature contributes equally to the model's learning process.

In [None]:
!pip install scikit-learn

Sometimes, you might encounter an error like **"ModuleNotFoundError: No module named 'sklearn'"**.
Run the code below to install the scikit-learn library and avoid this error.

In [None]:
# Import the StandardScaler class
from sklearn.preprocessing import StandardScaler

In [None]:
# Create an object of the class StandardScaler
scaler = StandardScaler()

In [None]:
# Fit and Transform the data for normalization
ss_scaler = scaler.fit_transform(data_to_scale)

In [None]:
ss_scaler

array([[ 0.32879835,  0.43390231,  1.1995349 ],
       [-0.40126276, -0.11198867,  0.05261118],
       [-0.10384182, -0.4275214 ,  0.15783354],
       [ 2.60514177,  2.61875676,  2.06235824],
       [-0.35097716, -0.69401428, -0.87334558],
       [-0.56561083, -0.32621539, -0.46297838],
       [-0.76399367, -0.00358478, -0.4103672 ],
       [-0.9338609 , -0.56947886, -1.43102409],
       [ 0.18560703, -0.9198557 , -0.29462261]])

In [None]:
# Convert numpy array to pandas DataFrame
df_ss = pd.DataFrame(ss_scaler)

# Use the head() method
print(df_ss.head())     #column name has changed to 0,1 and 2 instead of 'ApplicantIncome', 'Coapplicant Income' and 'LoanAmount'

          0         1         2
0  0.328798  0.433902  1.199535
1 -0.401263 -0.111989  0.052611
2 -0.103842 -0.427521  0.157834
3  2.605142  2.618757  2.062358
4 -0.350977 -0.694014 -0.873346


In [None]:
#To retain the original column names after converting your numpy array back to a pandas DataFrame,
# you can specify the column names when creating the DataFrame. Here's how you can do it:

# Original column names
column_names = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount']

# Convert numpy array to pandas DataFrame with specified column names
df_ss = pd.DataFrame(ss_scaler, columns=column_names)

# Use the head() method to view the first few rows
print(df_ss.head())

    Loan_ID  Gender  ApplicantIncome  CoapplicantIncome  LoanAmount   Area  \
0  LP001002     NaN           5849.0                0.0         NaN  urban   
1  LP001003    Male           4583.0                NaN       128.0   semi   
2  LP001005    Male           3000.0                0.0        66.0    NaN   
3  LP001006  Female           2583.0             2358.0       120.0   semi   
4  LP001008    Male              NaN                0.0       141.0  urban   

  Loan_Status  
0           Y  
1           N  
2           Y  
3         NaN  
4           Y  


# **Type 2: Min-Max Scalar**

Min-Max Scaler is a feature scaling technique that transforms the data into a specified range, usually between 0 and 1. It scales each feature independently by subtracting the minimum value and dividing by the range (max - min) of the feature.

In [None]:
# MinMax Normalization of the data
from sklearn.preprocessing import minmax_scale

In [None]:
# Fit and Transform the data for MinMax normalization
mm_scaler = minmax_scale(data_to_scale)

In [None]:
mm_scaler #numpy array

array([[0.35678392, 0.38256747, 0.75301205],
       [0.15049385, 0.22830051, 0.4246988 ],
       [0.23453474, 0.13913202, 0.45481928],
       [1.        , 1.        , 1.        ],
       [0.16470282, 0.06382203, 0.15963855],
       [0.10405476, 0.16776076, 0.27710843],
       [0.04799861, 0.25893508, 0.29216867],
       [0.        , 0.09901532, 0.        ],
       [0.31632299, 0.        , 0.3253012 ]])

In [None]:
#To retain the original column names after converting your numpy array back to a pandas DataFrame,
# you can specify the column names when creating the DataFrame. Here's how you can do it:

# Original column names
column_names = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount']

# Convert numpy array to pandas DataFrame with specified column names
df_mm = pd.DataFrame(mm_scaler, columns=column_names)

# Use the head() method to view the first few rows
print(df_mm.head())

   ApplicantIncome  CoapplicantIncome  LoanAmount
0         0.356784           0.382567    0.753012
1         0.150494           0.228301    0.424699
2         0.234535           0.139132    0.454819
3         1.000000           1.000000    1.000000
4         0.164703           0.063822    0.159639


# **Section 5: Data Splitting**

**Data splitting** is a technique used in machine learning to divide a dataset into separate parts for training and evaluating a model. This helps ensure that the model generalizes well to new, unseen data. The most common splits are:

**Training Set:** Used to train the model. Typically, this is the largest portion of the data.
                                                                                                  
**Validation Set:** Used to tune the model's hyperparameters and prevent overfitting. This set is optional but useful for model selection.
                                                                                                  
**Test Set:** Used to evaluate the final model's performance. This set should only be used once the model is fully trained and tuned.



------------------------------------------------------------------------------------------------------------



**Common Splitting Ratios**

**Training/Testing Split:** A common ratio is 80% training and 20% testing.

**Training/Validation/Test Split:** A common ratio is 70% training, 15% validation, and 15% testing.

In [None]:
df = dataset.copy()

In [None]:
df.head()

Unnamed: 0,Loan_ID,Gender,ApplicantIncome,CoapplicantIncome,LoanAmount,Area,Loan_Status
0,LP001002,,5849.0,0.0,,urban,Y
1,LP001003,Male,4583.0,,128.0,semi,N
2,LP001005,Male,3000.0,0.0,66.0,,Y
3,LP001006,Female,2583.0,2358.0,120.0,semi,
4,LP001008,Male,,0.0,141.0,urban,Y


In [None]:
# Split by column for X(independent) and Y(dependent) variables
X = df.iloc[:, :-1]
Y = df.iloc[:,  [-1]]

#Why [-1] for Y is written:

#Double Square Brackets ([[]]): When you use double square brackets (e.g., df.iloc[:, [-1]]), pandas returns a DataFrame, or else it doesn't return a DF.
# So when you're selecting a single column and want to return table-like data structure (dataframe, use this technique)

In [None]:
X.head()

Unnamed: 0,Loan_ID,Gender,ApplicantIncome,CoapplicantIncome,LoanAmount,Area
0,LP001002,,5849.0,0.0,,urban
1,LP001003,Male,4583.0,,128.0,semi
2,LP001005,Male,3000.0,0.0,66.0,
3,LP001006,Female,2583.0,2358.0,120.0,semi
4,LP001008,Male,,0.0,141.0,urban


In [None]:
Y.head()

Unnamed: 0,Loan_Status
0,Y
1,N
2,Y
3,
4,Y


In [None]:
df.head()

Unnamed: 0,Loan_ID,Gender,ApplicantIncome,CoapplicantIncome,LoanAmount,Area,Loan_Status
0,LP001002,,5849.0,0.0,,urban,Y
1,LP001003,Male,4583.0,,128.0,semi,N
2,LP001005,Male,3000.0,0.0,66.0,,Y
3,LP001006,Female,2583.0,2358.0,120.0,semi,
4,LP001008,Male,,0.0,141.0,urban,Y


In [None]:
# Split by rows for training and test datasets
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test =      \
train_test_split(X, Y, test_size=0.3, random_state=1234)

*Just see the results below, that the index value (row) matches for X_train and Y_train (70% split) and similary for X_test and Y_test (30% split)*

**Total values in dataset = 15**

**70% of 15** = 0.7*15 = 10.5 ≈ 11 (in X_train and Y_train)

**30% of 15** = 0.3*15 = 4.5 ≈ 04 (in X_test and Y_test)

In [None]:
X_train

Unnamed: 0,Loan_ID,Gender,ApplicantIncome,CoapplicantIncome,LoanAmount,Area
2,LP001005,Male,3000.0,0.0,66.0,
10,LP001024,Female,3200.0,700.0,70.0,urban
7,LP001014,Female,3036.0,2504.0,158.0,semi
1,LP001003,Male,4583.0,,128.0,semi
9,LP001020,Male,12841.0,10968.0,349.0,semi
8,LP001018,Male,4006.0,1526.0,168.0,rural
4,LP001008,Male,,0.0,141.0,urban
5,LP001011,Male,5417.0,4196.0,267.0,semi
6,LP001013,Male,2333.0,1516.0,,rural
3,LP001006,Female,2583.0,2358.0,120.0,semi


In [None]:
Y_train

Unnamed: 0,Loan_Status
2,Y
10,Y
7,N
1,N
9,N
8,Y
4,Y
5,Y
6,Y
3,


In [None]:
X_test

Unnamed: 0,Loan_ID,Gender,ApplicantIncome,CoapplicantIncome,LoanAmount,Area
13,LP001029,Male,1853.0,2840.0,114.0,urban
11,LP001027,Male,2500.0,1840.0,109.0,urban
0,LP001002,,5849.0,0.0,,urban
12,LP001028,Female,,8106.0,,urban
14,LP001030,Male,1299.0,1086.0,17.0,semi


In [None]:
Y_test

Unnamed: 0,Loan_Status
13,N
11,Y
0,Y
12,Y
14,Y
