# Name    : Abhishek Subhash Swami
# Roll No.:

# **Experiment No. 3**
# *Data Preprocessing, reading the dataset, handling missing data, conversion to tensor format*

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Data Preprocessing
> Data preprocessing is a crucial step in preparing data for analysis or machine learning. It involves tasks like reading datasets, handling missing data, and converting data to a suitable format, often tensors, for efficient computation. Here's how you can approach these tasks:

### Reading the dataset
> Reading a dataset is the first step. The choice of library depends on the dataset format (CSV, Excel, etc.). The pandas library is commonly used for reading structured data.

In [23]:
import pandas as pd
import numpy as np

data=pd.read_csv('/content/drive/MyDrive/Data/ParisHousing.csv')
data.head(5)

Unnamed: 0,squareMeters,numberOfRooms,hasYard,hasPool,floors,cityCode,cityPartRange,numPrevOwners,made,isNewBuilt,hasStormProtector,basement,attic,garage,hasStorageRoom,hasGuestRoom,price
0,75523,3.0,0,1,63.0,9373.0,3.0,8,2005.0,0,1.0,4313.0,9005.0,956,0.0,7.0,7559081.5
1,80771,39.0,1,1,98.0,39381.0,8.0,6,2015.0,1,0.0,3653.0,2436.0,128,1.0,2.0,8085989.5
2,55712,58.0,0,1,19.0,34457.0,6.0,8,2021.0,0,0.0,2937.0,8852.0,135,1.0,9.0,5574642.1
3,32316,47.0,0,0,6.0,27939.0,10.0,4,2012.0,0,1.0,659.0,7141.0,359,0.0,3.0,3232561.2
4,70429,19.0,1,1,90.0,38045.0,3.0,7,1990.0,1,0.0,8435.0,2429.0,292,1.0,4.0,7055052.0


### Handling Null values / missing data
>Handling null (missing) values is a critical part of data preprocessing, as missing data can negatively impact the quality of analysis or machine learning models. The pandas library in Python provides various methods to handle null values effectively. Here's a detailed explanation of common null value handling techniques in pandas

In [16]:
# Detecting null values

print(data.isnull().sum()) #gives total null values per column

squareMeters         0
numberOfRooms        1
hasYard              0
hasPool              0
floors               2
cityCode             8
cityPartRange        2
numPrevOwners        0
made                 2
isNewBuilt           0
hasStormProtector    1
basement             3
attic                1
garage               0
hasStorageRoom       1
hasGuestRoom         2
price                4
dtype: int64

In [17]:
# Removing null values

df=data.dropna()  #deletes rows with null value
print(df.isnull().sum())

squareMeters         0
numberOfRooms        0
hasYard              0
hasPool              0
floors               0
cityCode             0
cityPartRange        0
numPrevOwners        0
made                 0
isNewBuilt           0
hasStormProtector    0
basement             0
attic                0
garage               0
hasStorageRoom       0
hasGuestRoom         0
price                0
dtype: int64

In [18]:
df=data.dropna(axis=1) # deletes columns with null values
print(df.isnull().sum())

squareMeters     0
hasYard          0
hasPool          0
numPrevOwners    0
isNewBuilt       0
garage           0
dtype: int64

In [31]:
# Creating a DataFrame with null values
data = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [6, np.nan, 8, np.nan, 10]
})
print(data)

     A     B
0  1.0   6.0
1  2.0   NaN
2  NaN   8.0
3  4.0   NaN
4  5.0  10.0


In [30]:
# Imputing Null values with mean
imputed_data=data.fillna(data.mean())
print(imputed_data)


     A     B
0  1.0   6.0
1  2.0   8.0
2  3.0   8.0
3  4.0   8.0
4  5.0  10.0


In [29]:
# Imputing Null values with specific value
imputed_data=data.fillna(0)
print(imputed_data)

     A     B
0  1.0   6.0
1  2.0   0.0
2  0.0   8.0
3  4.0   0.0
4  5.0  10.0


In [32]:

# Forward fill
data_ffill = data.fillna(method='ffill')
print(data_ffill)

     A     B
0  1.0   6.0
1  2.0   6.0
2  2.0   8.0
3  4.0   8.0
4  5.0  10.0


In [33]:

# Backward fill
data_bfill = data.fillna(method='bfill')
print(data_bfill)

     A     B
0  1.0   6.0
1  2.0   8.0
2  4.0   8.0
3  4.0  10.0
4  5.0  10.0


In [35]:
# Interpolation

data_interpolated = data.interpolate(method='linear')
print(data_interpolated)

     A     B
0  1.0   6.0
1  2.0   7.0
2  3.0   8.0
3  4.0   9.0
4  5.0  10.0


### Conversion to Tensor Format
> Tensors are fundamental data structures used for numerical computations, and they play a crucial role in various fields, particularly in machine learning and scientific computing. They are multi-dimensional arrays that can be used to represent data in a structured and efficient way. Here's a bit more detail about tensors and how they are used in different libraries

#### Numpy Tensors
>In the numpy library, tensors are implemented as numpy arrays. These arrays can have any number of dimensions and are used for various mathematical and numerical operations.

In [38]:
numpy_tensor=data_interpolated.to_numpy()
print(numpy_tensor)

[[ 1.  6.]
 [ 2.  7.]
 [ 3.  8.]
 [ 4.  9.]
 [ 5. 10.]]


#### PyTorch Tensors
>In the numpy library, tensors are implemented as numpy arrays. These arrays can have any number of dimensions and are used for various mathematical and numerical operations.

In [40]:
import torch

# Convert DataFrame to PyTorch tensor
torch_tensor = torch.tensor(data_interpolated.values)
print(torch_tensor)

tensor([[ 1.,  6.],
        [ 2.,  7.],
        [ 3.,  8.],
        [ 4.,  9.],
        [ 5., 10.]], dtype=torch.float64)

#### Tensorflow Tensors
>TensorFlow library also uses tensors as the primary data structure for building and training machine learning models.

In [41]:
import tensorflow as tf

# Convert DataFrame to TensorFlow tensor
tf_tensor = tf.constant(data_interpolated.values)
print(tf_tensor)

tf.Tensor(
[[ 1.  6.]
 [ 2.  7.]
 [ 3.  8.]
 [ 4.  9.]
 [ 5. 10.]], shape=(5, 2), dtype=float64)
