<a href="https://colab.research.google.com/github/vchiranjeeviak/Machine_Learning_Projects/blob/main/01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Importing Libraries**

In [16]:
# Importing the pandas and numpy libraries
import pandas as pd
import numpy as np

**Loading Dataset**

In [17]:
# read_csv is a function in pandas to load datasets of type csv
# csv means comma separated values
# in a csv file each row in the below table is a line and values in each row are seperated by commas
df = pd.read_csv('/content/sample_data/california_housing_train.csv')
# In IPython we dont need to use print function to print a variable if it is the last of that cell.
df 

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0
...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0
16997,-124.30,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0
16998,-124.30,41.80,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0


**Dealing with missing data**

In [18]:
df.info()
# This info function gives information about our dataframe
# The range index below shows no.of rows which is 17000
# And also non-null (non-missing) values for each columns
# In the below output, it is showing that every column has 17000 non-null values, which means there are no missing values.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           17000 non-null  float64
 1   latitude            17000 non-null  float64
 2   housing_median_age  17000 non-null  float64
 3   total_rooms         17000 non-null  float64
 4   total_bedrooms      17000 non-null  float64
 5   population          17000 non-null  float64
 6   households          17000 non-null  float64
 7   median_income       17000 non-null  float64
 8   median_house_value  17000 non-null  float64
dtypes: float64(9)
memory usage: 1.2 MB


In [19]:
# In the real time or most other dataset, there will be missing values for sure and we have to know how to deal with them.
# To learn that, first we should have some missing values.
# So, let's remove some values form our dataset which creates some missing values and so that we can deal with them
for index in range(len(df)):
  if(index%5 == 0):
    df['housing_median_age'][index] = np.nan

# Here we are iterating through the df and inserting np.nan which is null in place of housing_median_age wherever the index is a multiple of 5
df.head(10)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0
5,-114.58,33.63,,1387.0,236.0,671.0,239.0,3.3438,74000.0
6,-114.58,33.61,25.0,2907.0,680.0,1841.0,633.0,2.6768,82400.0
7,-114.59,34.83,41.0,812.0,168.0,375.0,158.0,1.7083,48500.0
8,-114.59,33.61,34.0,4789.0,1175.0,3134.0,1056.0,2.1782,58400.0
9,-114.6,34.83,46.0,1497.0,309.0,787.0,271.0,2.1908,48100.0


In [20]:
df.info()
# Now we can see only 13600 non-null values out 17000 total entries
# which means that 3400 are null/missing values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           17000 non-null  float64
 1   latitude            17000 non-null  float64
 2   housing_median_age  13600 non-null  float64
 3   total_rooms         17000 non-null  float64
 4   total_bedrooms      17000 non-null  float64
 5   population          17000 non-null  float64
 6   households          17000 non-null  float64
 7   median_income       17000 non-null  float64
 8   median_house_value  17000 non-null  float64
dtypes: float64(9)
memory usage: 1.2 MB


In [21]:
# When we have missing values like these we have multiple options to do
# 1. Remove columns consisting missing values. This is a bad idea. Because in real world almost all the columns consist missing values, so we may end up with empty dataset.
# Or some columns are so crucial for our purpose that it is not affordable to remove those columns.
# 2. Remove rows consisting missing values. This is better than previous one, but it might reduce many rows in our dataset which results in poor performance of our model.
# The more data we have, the more it is beneficial
# 3. Fill those missing values with some logical value. This logical value can be anything like mean or median or mode of other values.

# In our present case, we have missing values in housing_median_age. Let's fill it with the mean of all other values.
df['housing_median_age'] = df['housing_median_age'].fillna(df['housing_median_age'].mean())

In [22]:
df.info()
# If we check now, the missing values will be gone

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           17000 non-null  float64
 1   latitude            17000 non-null  float64
 2   housing_median_age  17000 non-null  float64
 3   total_rooms         17000 non-null  float64
 4   total_bedrooms      17000 non-null  float64
 5   population          17000 non-null  float64
 6   households          17000 non-null  float64
 7   median_income       17000 non-null  float64
 8   median_house_value  17000 non-null  float64
dtypes: float64(9)
memory usage: 1.2 MB


**Identifying dependent and independent variables**

In [23]:
# An independent variable is a column in a dataset whose value is not dependent or not based on other columns
# A dependent variable is a column in a dataset whose value is dependent or based on other columns
# In our dataset, median_house_value is dependent on all other columns whereas all other columns are not much dependent on each other.
# Using common sense, we can say that price of the house is dependent on place, size, area, income of the owner etc
# We separate independent (X) and dependent (y) variables from our dataframe to make it easy for our process.
X = df.drop('median_house_value', axis=1)
y = df['median_house_value']
# drop function on a dataframe returns a new dataframe excluding the column that we called to drop.
# drop function by default considers each row as a feature. But in our dataframe each column is a feature. That is why we are giving axis=1 attribute to let it know that
# each column is a feature. If we dont provide any value, it considers axis=0 which means each row is a feature. In our case each row is a data item and each column is a feature

In [24]:
X.head()
# head function returns top 5 rows of a dataframe
# In the original df dataframe there are all rows
# Since we separated independent variables in X, we get all columns except median_house_value

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income
0,-114.31,34.19,28.528897,5612.0,1283.0,1015.0,472.0,1.4936
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925


In [25]:
y.tail()
# tail function returns bottom 5 rows of a dataframe
# It contains only median_houe_value in it after separation
# We can use both head or tail for any dataframe, we just use it to have an idea how our dataframe is looking like

16995    111400.0
16996     79000.0
16997    103600.0
16998     85800.0
16999     94600.0
Name: median_house_value, dtype: float64