# Data cleaning

In [None]:
import pandas as pd
import numpy as np
housing = pd.read_csv('housing.csv')

Remove unnecessary columns

In [None]:
housing = housing_df.drop(["column_name"], axis=1)

Seperate Numerical and categorical data

In [None]:
housing_cat = housing[["ocean_proximity"]]
housing_target = housing["median_house_value"]
housing_num = housing.drop(["ocean_proximity","median_house_value"], axis=1)

In [None]:
print(housing_num.head())
print(housing_cat.head())
print(housing_target.head())

## Handling missing values

Manually filling missing values in a column with the median value

In [None]:
housing["total_bedrooms"].fillna(housing["total_bedrooms"].median(), inplace=True)

### SimpleImputer 
- Scikit-Learn provides a handy class to take care of missing values: SimpleImputer.
- Create a SimpleImputer instance, specifying that we want to replace each attribute’s missing values with the median of that attribute.
- median can only be computed on numerical attributes, we need to create a copy of the data without the text attribute(categorical data)

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")

- Now we can fit the imputer instance to the training data using the fit() method
- we cannot be sure that there won’t be any missing values in new data after the system goes live, so it is safer to apply the imputer to all the numerical attributes.

In [None]:
imputer.fit(housing_num)

The imputer has simply computed the median of each attribute and stored the result in its statistics_

In [None]:
imputer.statistics_

- Now we can use this “trained” imputer to transform the training set by replacing missing values by the learned medians.

In [None]:
X = imputer.transform(housing_num)

- The result is a plain NumPy array containing the transformed features. To put it back into a Pandas DataFrame use pd.DataFrame

In [None]:
housing_tr = pd.DataFrame(X, columns=housing_num.columns)

imputer is a scikit learn Estimator so it has the fit method. 

Also it is transform so it has the transform() method.

For transformers we can use the fit_transform() method to fit and transform the data simultaneously. 
**fit_transform()** is the optimized method.

In [None]:
X = imputer.fit_transform(housing_num)

## Handling categorical data

- ordinal encoding
- Onehot Encoding

In [None]:
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
housing_cat_encoded[:10]

In [None]:
ordinal_encoder.categories_

In [None]:
from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

In [None]:
housing_cat_1hot.toarray()

In [None]:
housing_cat_1hot_df = pd.DataFrame(housing_cat_1hot.toarray(), columns=cat_encoder.get_feature_names_out(["ocean_proximity"]))

In [None]:
housing_cat_1hot_df.describe()

series vs dataframe in pandas df["column"] vs df[["column"]]

In Pandas, the behavior you've described is intentional and follows the design of the DataFrame structure.

When you use single square brackets `df["column"]`, you are indexing a single column, and the result is a Pandas Series. A Series is essentially a one-dimensional labeled array, and it retains the index of the original DataFrame.

On the other hand, when you use double square brackets `df[["column"]]`, you are indexing with a list of column names, even if there is only one column in the list. This syntax is designed to return a DataFrame with the specified column(s). The result is a DataFrame with one or more columns, and it retains the DataFrame structure.

Here's a simple example to illustrate:

```python
import pandas as pd

# Creating a DataFrame
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

# Single square brackets return a Series
series_result = df['A']
print(type(series_result))  # <class 'pandas.core.series.Series'>

# Double square brackets return a DataFrame
df_result = df[['A']]
print(type(df_result))  # <class 'pandas.core.frame.DataFrame'>
```

In both cases, you can access the data within the Series or DataFrame using standard Pandas operations. However, the choice between a Series and a DataFrame depends on your specific use case and the structure of the data you are working with.