<div> <img src="everything-about-pandas.png" alt="Drawing" style="width: 600px;"/></div> 

### Introduction 👇

Pandas is a powerful and flexible open-source data analysis and manipulation tool built on top of the Python programming language. It provides easy-to-use data structures and data analysis tools for handling and manipulating numerical tables and time series data.

With pandas, you can load and manipulate datasets, perform aggregations and transformations, create visualizations, and perform advanced statistical analysis all within Python. It is an essential library for data scientists and analysts working with large datasets.

Some of the key features of pandas are:

- Fast and efficient DataFrame object for data manipulation with integrated indexing
- Tools for reading and writing data between in-memory data structures and different file formats
- Data alignment and handling of missing data
- Reshaping and pivoting of data sets 
- Label-based slicing, indexing, and subsetting of large data sets
- Columns can be inserted and deleted from data structures for size mutability
- Group by engine allowing split-apply-combine operations on data sets
- Data set merging and joining
- Time series-functionality 



### Data Preprocessing 👇
Data preprocessing is an important step in the data analysis process, as it ensures that the data is clean, correct, and ready for analysis. Here are the general steps involved in data preprocessing using pandas:

- Load the data
- Explore the data
- Clean the data
- Transform the data
- Split the data
- Feature scaling

That's it! These are the general steps involved in data preprocessing using pandas.

In [1]:
# import necessary libraries
import numpy as np
import pandas as pd

##### 1) Load the data: ⭐ 
The first step is to load the data into a pandas DataFrame. This can be done using the read_csv function, which reads a comma-separated values (CSV) file into a DataFrame. You can also use the read_excel function to read an Excel file, or the read_sql function to read data from a SQL database.


In [2]:
housing = pd.read_csv("D:/Datasets/housing.csv")

In [3]:
df = pd.DataFrame(housing)

In [4]:
df.shape

(20640, 10)


##### 2) Explore the data:  ⭐
Once the data is loaded, it is important to explore it and get a sense of its structure and content. You can use the head and tail functions to view the first and last few rows of the data, and the describe function to compute some basic statistics about the data. You can also use the info and dtypes functions to get more information about the data types of the columns.


In [5]:
columns =housing.columns
print(columns)

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'ocean_proximity'],
      dtype='object')


In [6]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [7]:
df.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


In [9]:
df.dtypes

longitude             float64
latitude              float64
housing_median_age    float64
total_rooms           float64
total_bedrooms        float64
population            float64
households            float64
median_income         float64
median_house_value    float64
ocean_proximity        object
dtype: object

##### 3) Clean the data: ⭐
The next step is to clean the data by handling missing values and ensuring that the data is in a consistent format. You can use the isnull and notnull functions to identify missing values, and the fillna function to fill in missing values with a placeholder value. You can also use the apply function to apply a custom function to each element in the DataFrame.


In [10]:
df.isna().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

In [11]:
med = df["total_bedrooms"].median()
med

435.0

In [12]:
df['total_bedrooms'] = df['total_bedrooms'].fillna(med)

In [13]:
df.isna().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64

##### Transform the data: ⭐
After the data is cleaned, you may need to transform it by adding or removing columns, or by changing the data types of the columns. You can use the assign function to add new columns, the drop function to remove columns, and the astype function to change the data types of the columns.
- One hot encoding

    It is a technique used to represent categorical variables as numerical data. It creates a new binary column for each unique category in a categorical variable. Each row has a 1 in the column for the category it belongs to, and a 0 in all other columns.

In [14]:
ocean_prox =pd.get_dummies(df['ocean_proximity'])

In [15]:
ocean_prox.head()

Unnamed: 0,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
0,0,0,0,1,0
1,0,0,0,1,0
2,0,0,0,1,0
3,0,0,0,1,0
4,0,0,0,1,0


In [16]:
df=pd.concat([df,ocean_prox], axis=1)

In [17]:
df.drop(['ocean_proximity'],axis=1,inplace = True) 

In [18]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,0,0,0,1,0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,0,0,0,1,0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,0,0,0,1,0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,0,0,0,1,0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,0,0,0,1,0


In [19]:
df.shape

(20640, 14)


##### Split the data: ⭐
If you are planning to build a machine learning model, you will need to split the data into a training set and a test set. You can use the train_test_split function from the sklearn.model_selection module to split the data into these two sets.


In [20]:
#import necessary libraries
from sklearn.model_selection import train_test_split

In [21]:
y = df['median_house_value' ].values
X = df.drop(['median_house_value'] , axis = 1).values

In [22]:
x_train,x_test,y_train , y_test = train_test_split(X,y, test_size=0.2 , random_state=20)


##### Feature scaling: ⭐
Many machine learning algorithms require that the input features are scaled to a uniform range. You can use the StandardScaler class from the sklearn.preprocessing module to scale the features to a standard normal distribution.

In [23]:
#import necessary libraries
from sklearn.preprocessing import StandardScaler

In [24]:
scaler = StandardScaler()

In [25]:
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)

### Summary
Pandas is a powerful data manipulation library in Python that allows you to clean, transform, and analyze data efficiently.
Data preprocessing involves loading, exploring, cleaning, transforming, splitting, and scaling the data to prepare it for analysis.