# Data preparation and engineering

Data preparation and engineering is the process of preparing data for analysis or modeling. This can involve a variety of tasks, such as cleaning and preprocessing the data, transforming the data into a suitable format, and selecting relevant features for the analysis or model. Here are the steps involved in data preparation and engineering:

1. Data acquisition: This step involves collecting data from various sources, such as databases, APIs, or flat files.

2. Data cleaning: This step involves identifying and correcting errors, missing values, and inconsistencies in the data.

3. Data preprocessing: This step involves transforming the data into a suitable format for analysis or modeling. This can include tasks such as scaling, encoding, and discretization.

4. Feature selection: This step involves selecting a subset of relevant features from the data for use in the analysis or model.

5. Data transformation: This step involves applying transformation techniques to the data, such as aggregation, summarization, or normalization, in order to prepare it for analysis or modeling.

Before going to learn data cleaning and preprocessing, let us explore two commonly used python libraries - Numpy and Pandas.

# 1.1 Numpy and Pandas

## 1.1.1 Numpy :

<b>NumPy</b> is a Python library that is used for working with large, multi-dimensional arrays and matrices of numerical data. It provides a variety of functions and methods for performing operations on these arrays, including mathematical, statistical, and logical operations.

Here are some examples of common NumPy operations:

1. <b>Creating arrays:</b> You can create NumPy arrays using the np.array function, which takes a list or tuple as input and returns a NumPy array. For example:

In [11]:
import numpy as np

# Create a 1-dimensional array
a = np.array([1, 2, 3, 4])

# Create a 2-dimensional array
b = np.array([[1, 2, 3], [4, 5, 6]])

# Get the data type of a NumPy array
print(a.dtype)  # Output: int64

# Get the number of dimensions of a NumPy array
print(a.ndim)  # Output: 1
print(b.ndim)  # Output: 2

# Get the size of a NumPy array (total number of elements)
print(a.size)  # Output: 4
print(b.size)  # Output: 6

int64
1
2
4
6


2. <b>Array shape and size:</b> You can use the shape and size attributes to find out the dimensions and size of a NumPy array. For example:

In [2]:
import numpy as np

# Create a 2-dimensional array
a = np.array([[1, 2, 3], [4, 5, 6]])

# Get the shape of the array
print(a.shape)  # Output: (2, 3)

# Get the size of the array
print(a.size)  # Output: 6

(2, 3)
6


3. <b>Array indexing and slicing:</b> You can access individual elements or slices of a NumPy array using indexing and slicing. For example:

In [3]:
import numpy as np

# Create a 1-dimensional array
a = np.array([1, 2, 3, 4])

# Access the second element
print(a[1])  # Output: 2

# Access the second and third elements
print(a[1:3])  # Output: [2, 3]

# Create a 2-dimensional array
b = np.array([[1, 2, 3], [4, 5, 6]])

# Access the element at row 0, column 1
print(b[0, 1])  # Output: 2

# Access the second column
print(b[:, 1])  # Output: [2, 5]

2
[2 3]
2
[2 5]


In [13]:
# Reshape a NumPy array
c = a.reshape((2, 2))
print(c)  # Output: [[1 2]
            #          [3 4]]

# Transpose a NumPy array
d = c.T
print(d)  # Output: [[1 3]
            #          [2 4]]

# Flatten a NumPy array
e = d.flatten()
print(e)  # Output: [1 3 2 4]

# Convert a NumPy array to a list
f = e.tolist()
print(f)  # Output: [1, 3, 2, 4]

[[1 2]
 [3 4]]
[[1 3]
 [2 4]]
[1 3 2 4]
[1, 3, 2, 4]


4. <b>Array math:</b> NumPy provides a variety of functions and methods for performing mathematical and statistical operations on arrays. For example:

In [16]:
import numpy as np

# Create two 1-dimensional arrays
a = np.array([1, 2, 3, 4])
b = np.array([5, 6, 7, 8])

# Add the arrays
c = a + b
print(c)  # Output: [6, 8, 10, 12]

# Multiply the arrays
d = a * b
print(d)  # Output: [5, 12, 21, 32]

# Take the square root of the array
e = np.sqrt(a)
print(e)  # Output: [1.0, 1.4142, 1.7321, 2.0]

# Compute the sum of all elements
print(np.sum(a))  # Output: 10

# Compute the mean of all elements
print(np.mean(a))  # Output: 2.5

# Compute the standard deviation of all elements
print(np.std(a))  # Output: 1.118033988749895

[ 6  8 10 12]
[ 5 12 21 32]
[1.         1.41421356 1.73205081 2.        ]
10
2.5
1.118033988749895


4. <b>Matrix operation:</b>  Here are some common matrix operations that you can perform using NumPy

In [21]:
import numpy as np

# Create a 2D NumPy array (matrix)
A = np.array([[1, 2], [3, 4]])
print(A)  # Output: [[1 2]
            #          [3 4]]

# Create another 2D NumPy array (matrix)
B = np.array([[5, 6], [7, 8]])
print(B)  # Output: [[5 6]
            #          [7 8]]

# Matrix addition
C = np.add(A, B) # OR  C = A + B
print(C)  # Output: [[ 6  8]
            #          [10 12]]

# Matrix subtraction
D = np.subtract(A, B) # OR  D = A - B
print(D)  # Output: [[-4 -4]
            #          [-4 -4]]

# Matrix element-wise multiplication 
E = np.multiply(A,B) # E = A * B
print(E)  # Output: [[ 5 12]
            #          [21 32]]

# Matrix element-wise division (not matrix division!)
F = A / B
print(F)  # Output: [[ 0.2         0.33333333]
            #          [ 0.42857143  0.5       ]]

# Matrix transpose
G = A.T
print(G)  # Output: [[1 3]
            #          [2 4]]

# Matrix multiplication (dot product)
H = np.dot(A, B)
print(H)  # Output: [[19 22]
            #          [43 50]]

# Matrix inverse
I = np.linalg.inv(A)
print(I)  # Output: [[-2.   1. ]
            #          [ 1.5 -0.5]]

# Matrix determinant
J = np.linalg.det(A)
print(J)  # Output: -2.0

[[1 2]
 [3 4]]
[[5 6]
 [7 8]]
[[ 6  8]
 [10 12]]
[[-4 -4]
 [-4 -4]]
[[ 5 12]
 [21 32]]
[[0.2        0.33333333]
 [0.42857143 0.5       ]]
[[1 3]
 [2 4]]
[[19 22]
 [43 50]]
[[-2.   1. ]
 [ 1.5 -0.5]]
-2.0000000000000004


## 1.1.2 Pandas

<b>Pandas</b> is a Python library for working with data structures and data analysis tools. It provides a variety of functions and methods for manipulating and analyzing data, including functions for reading and writing data, handling missing values, filtering and sorting data, and performing statistical analysis.

In Pandas, there are two main types of data structures: Series and DataFrames.

A <b>Series</b> is a one-dimensional labeled array that can hold any data type. It is similar to a column in a spreadsheet or a field in a database table. A Series has a name, which is the label for the data, and a data type, which specifies the type of data that the Series contains.

A <b>DataFrame</b> is a two-dimensional labeled data structure with columns of potentially different types. You can think of a DataFrame as a table or a spreadsheet, with rows and columns. Each column in a DataFrame has a name and a data type, and each row has an index.

Here is an example of creating a Pandas Series and a DataFrame:

In [24]:
import pandas as pd

# Create a Series
s = pd.Series([1, 2, 3, 4], name='numbers')

# Create a DataFrame
df = pd.DataFrame({'numbers': [1, 2, 3, 4], 'letters': ['a', 'b', 'c', 'd']})
print("Series:")
print(s)

print("DataFrame:")
print(type(df))
print(df)

Series:
0    1
1    2
2    3
3    4
Name: numbers, dtype: int64
DataFrame:
<class 'pandas.core.frame.DataFrame'>
   numbers letters
0        1       a
1        2       b
2        3       c
3        4       d


In this example, the Series s has a name of 'numbers' and contains numerical data, and the DataFrame df has two columns: 'numbers' and 'letters', which contain numerical and string data, respectively.

Here are some examples of common Pandas operations:

1. <b>Reading and writing data:</b> You can use the read_csv function to read data from a CSV file into a Pandas DataFrame, and the to_csv function to write a DataFrame to a CSV file. For example:

In [33]:
import pandas as pd

# Read a CSV file into a DataFrame
df = pd.read_csv('student.csv')
df
# Write the DataFrame to a CSV file
# df.to_csv('output.csv')

Unnamed: 0,id,name,class,mark,gender
0,1,John Deo,Four,75.0,female
1,2,Max Ruin,Three,85.0,male
2,3,Arnold,Three,55.0,male
3,4,Krish Star,Four,60.0,female
4,5,John Mike,Four,60.0,female
5,6,Alex John,Four,55.0,male
6,7,My John Rob,Fifth,,male
7,8,Asruid,Five,85.0,male
8,9,Tes Qry,Six,78.0,male
9,10,Big John,Four,,female


2. <b>Handling missing values:</b> You can use the isnull and notnull functions to identify missing values in a DataFrame, and the fillna function to fill in missing values. For example:

In [34]:
import pandas as pd

# Read a CSV file into a DataFrame
df = pd.read_csv('student.csv')

# Identify missing values
missing = df[df.isnull()]

# Fill in missing values
df = df.fillna(0)
df

Unnamed: 0,id,name,class,mark,gender
0,1,John Deo,Four,75.0,female
1,2,Max Ruin,Three,85.0,male
2,3,Arnold,Three,55.0,male
3,4,Krish Star,Four,60.0,female
4,5,John Mike,Four,60.0,female
5,6,Alex John,Four,55.0,male
6,7,My John Rob,Fifth,0.0,male
7,8,Asruid,Five,85.0,male
8,9,Tes Qry,Six,78.0,male
9,10,Big John,Four,0.0,female


3. <b>head()</b> function with a parameter: We can pass the number of rows as a parameter to head() function.

In [79]:
df.head(2)

Unnamed: 0,id,name,class,mark,gender
0,1,John Deo,Four,75.0,female
1,2,Max Ruin,Three,85.0,male


4. <b>Slicing:</b> Slicing is done to pull out a set of rows from the dataset.

In [80]:
df[1:2]

Unnamed: 0,id,name,class,mark,gender
1,2,Max Ruin,Three,85.0,male


3. <b>Filtering and sorting data:</b> You can use the where function to filter a DataFrame based on a condition, and the sort_values function to sort a DataFrame by one or more columns. For example:

In [41]:
import pandas as pd

# Read a CSV file into a DataFrame
df = pd.read_csv('student.csv')

# Filter the DataFrame based on a condition
filtered = df.where(df['mark'] > 70)

# Sort the DataFrame by a column
sorted = df.sort_values('mark')
sorted

Unnamed: 0,id,name,class,mark,gender
2,3,Arnold,Three,55.0,male
5,6,Alex John,Four,55.0,male
3,4,Krish Star,Four,60.0,female
4,5,John Mike,Four,60.0,female
0,1,John Deo,Four,75.0,female
8,9,Tes Qry,Six,78.0,male
1,2,Max Ruin,Three,85.0,male
7,8,Asruid,Five,85.0,male
6,7,My John Rob,Fifth,,male
9,10,Big John,Four,,female


4. <b>Statistical analysis:</b> Pandas provides a variety of functions and methods for performing statistical analysis on data. For example:

In [51]:
import pandas as pd

# Read a CSV file into a DataFrame
df = pd.read_csv('student.csv')

# Calculate the mean of a column
mean = df['mark'].mean()
print("mean", mean)

# Calculate the standard deviation of a column
std = df['mark'].std()
print("std", std)

# Calculate the correlation between two columns
corr = df['id'].corr(df['mark'])
print("corr", corr)

mean 69.125
std 12.999313168669445
corr 0.13757616629758387


# 1.2 Data preparation and cleaning methods

Data preparation and cleaning is a crucial step in the data analysis process. It involves getting the data into a form that is suitable for analysis and modeling. Here are some common techniques and methods used in data preparation and cleaning:

1. <b>Handling missing values:</b> Missing values can cause problems during analysis and modeling, so it is important to identify and handle them appropriately. One approach is to simply drop rows or columns with missing values. Another approach is to impute the missing values using methods such as mean imputation or median imputation.

In [71]:
import pandas as pd

# Load a dataset with missing values
df = pd.read_csv('student.csv')

# Drop rows with missing values
df.dropna(axis=0, inplace=True)

# Drop columns with missing values
df.dropna(axis=1, inplace=True)
df

Unnamed: 0,id,name,class,mark,gender
0,1,John Deo,Four,75.0,female
1,2,Max Ruin,Three,85.0,male
2,3,Arnold,Three,55.0,male
3,4,Krish Star,Four,60.0,female
4,5,John Mike,Four,60.0,female
5,6,Alex John,Four,55.0,male
7,8,Asruid,Five,85.0,male
8,9,Tes Qry,Six,78.0,male


2. <b>Handling outliers:</b> Outliers are extreme values that can have a significant impact on statistical analyses and modeling. There are several approaches to handling outliers, such as dropping them, transforming them, or binning them.

3. <b>Scaling and normalization:</b> Scaling and normalization is often used to transform variables to a common scale. This can be important when working with algorithms that use distance measures, such as k-means clustering or K-nearest neighbors.

4. <b>Encoding categorical variables:</b> Categorical variables need to be encoded as numerical values before they can be used in most machine learning models. One way to do this is to use one-hot encoding, which creates a new binary column for each category.

In [63]:
import pandas as pd

# Load a dataset with missing values
df = pd.read_csv('student.csv')

# One-hot encode categorical variables
df = pd.get_dummies(df, columns=['mark'])
df

Unnamed: 0,id,name,class,gender,mark_55.0,mark_60.0,mark_75.0,mark_78.0,mark_85.0
0,1,John Deo,Four,female,0,0,1,0,0
1,2,Max Ruin,Three,male,0,0,0,0,1
2,3,Arnold,Three,male,1,0,0,0,0
3,4,Krish Star,Four,female,0,1,0,0,0
4,5,John Mike,Four,female,0,1,0,0,0
5,6,Alex John,Four,male,1,0,0,0,0
6,7,My John Rob,Fifth,male,0,0,0,0,0
7,8,Asruid,Five,male,0,0,0,0,1
8,9,Tes Qry,Six,male,0,0,0,1,0
9,10,Big John,Four,female,0,0,0,0,0
