<h1>Data Preprocessing</h1>
<p>Data preprocessing is the process of preparing and cleaning raw data before it is used for analysis or input into machine learning models. The goal is to transform the data into a format that is consistent, accurate, and ready for modeling. This involves handling missing values, correcting errors, normalizing data, encoding categorical variables, and removing outliers, among other steps.</p>

In [1]:
import pandas as pd

<h2>Pandas</h2>
<p>Pandas is an open-source Python library used for data manipulation and analysis. It provides powerful data structures, such as DataFrame and Series, that allow efficient handling, cleaning, and processing of structured data.</p>

In [2]:
# Create a dictionary of sample data
data = {
    'Name': ['John', 'Sarah', 'Mike', 'Anna', 'David'],
    'Age': [28, 35, 40, 22, 45],
    'Income': [50000, 60000, 65000, 45000, 70000],
    'Marital Status': ['Single', 'Married', 'Married', 'Single', 'Divorced']
}

# Create a DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
df

Unnamed: 0,Name,Age,Income,Marital Status
0,John,28,50000,Single
1,Sarah,35,60000,Married
2,Mike,40,65000,Married
3,Anna,22,45000,Single
4,David,45,70000,Divorced


<h3>In Pandas, we regularly use a variety of basic functions to manipulate and analyze data efficiently. Some of the most common functions are :</h3>

In [3]:
# head(): Displays the first few rows of a DataFrame to get an overview of the data.

df.head(2)

Unnamed: 0,Name,Age,Income,Marital Status
0,John,28,50000,Single
1,Sarah,35,60000,Married


In [4]:
# tail(): Displays the last few rows of a DataFrame.

df.tail(2)

Unnamed: 0,Name,Age,Income,Marital Status
3,Anna,22,45000,Single
4,David,45,70000,Divorced


In [5]:
# info(): Provides information about the DataFrame, including the data types, non-null counts, and memory usage.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Name            5 non-null      object
 1   Age             5 non-null      int64 
 2   Income          5 non-null      int64 
 3   Marital Status  5 non-null      object
dtypes: int64(2), object(2)
memory usage: 288.0+ bytes


In [6]:
# describe(): Generates summary statistics (like mean, standard deviation, min, and max) for numeric columns.

df.describe()

Unnamed: 0,Age,Income
count,5.0,5.0
mean,34.0,58000.0
std,9.192388,10368.220677
min,22.0,45000.0
25%,28.0,50000.0
50%,35.0,60000.0
75%,40.0,65000.0
max,45.0,70000.0


In [7]:
# isnull(): Identifies missing or NaN (Not a Number) values in the DataFrame.

df.isnull()

Unnamed: 0,Name,Age,Income,Marital Status
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False
4,False,False,False,False


In [8]:
# columns: Returns the labels of the columns in a DataFrame.

df.columns

Index(['Name', 'Age', 'Income', 'Marital Status'], dtype='object')

In [9]:
# sample(): Randomly selects a specified number of rows from a DataFrame or Series.

df.sample

<bound method NDFrame.sample of     Name  Age  Income Marital Status
0   John   28   50000         Single
1  Sarah   35   60000        Married
2   Mike   40   65000        Married
3   Anna   22   45000         Single
4  David   45   70000       Divorced>

In [10]:
# index: Returns the labels of the rows (index) in a DataFrame or Series.

df.index

RangeIndex(start=0, stop=5, step=1)

In [11]:
# shape: Returns the dimensions of a DataFrame as a tuple (rows, columns).

df.shape

(5, 4)

In [12]:
# size: Returns the total number of elements in a DataFrame (rows × columns).

df.size

20

In [13]:
# dtypes: Returns the data types of each column in the DataFrame.

df.dtypes

Name              object
Age                int64
Income             int64
Marital Status    object
dtype: object

In [14]:
# values: Returns the underlying data as a NumPy array from a DataFrame or Series.

df.values

array([['John', 28, 50000, 'Single'],
       ['Sarah', 35, 60000, 'Married'],
       ['Mike', 40, 65000, 'Married'],
       ['Anna', 22, 45000, 'Single'],
       ['David', 45, 70000, 'Divorced']], dtype=object)

In [15]:
# empty: Returns True if the DataFrame is empty (i.e., has no elements), otherwise False.

df.empty

False

In [16]:
# ndim: Returns the number of dimensions (axes) in a DataFrame or Series (1 for Series, 2 for DataFrame).

df.ndim

2

In [17]:
# T: Transposes the DataFrame (swaps rows and columns).

df.T

Unnamed: 0,0,1,2,3,4
Name,John,Sarah,Mike,Anna,David
Age,28,35,40,22,45
Income,50000,60000,65000,45000,70000
Marital Status,Single,Married,Married,Single,Divorced


<h3>Accessing Individual Column</h3>

In [18]:
# Accessing Individual Column

df.Age

0    28
1    35
2    40
3    22
4    45
Name: Age, dtype: int64

OR

In [19]:
df['Age']

0    28
1    35
2    40
3    22
4    45
Name: Age, dtype: int64

In [20]:
df.to_csv('example.csv')

In [21]:
df = pd.read_csv('example.csv')

df

Unnamed: 0.1,Unnamed: 0,Name,Age,Income,Marital Status
0,0,John,28,50000,Single
1,1,Sarah,35,60000,Married
2,2,Mike,40,65000,Married
3,3,Anna,22,45000,Single
4,4,David,45,70000,Divorced
