<a href="https://colab.research.google.com/github/tao73bot/PythonLearing-BS/blob/main/Data_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Data Cleaning

In [5]:
import pandas as pd
import numpy as np

df = pd.DataFrame({
    "Name": ["Avik","Fahad","Sizan",np.nan],
    "Age": [25,24,np.nan,24],
    "Occupation": ["Banker","Data Scientist",".Net Developer","System Architect"]
})

In [6]:
# fill sizan's age
df.Age[2] = 24

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df.Age[2] = 24
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.Age[2] = 24


In [7]:
df

Unnamed: 0,Name,Age,Occupation
0,Avik,25.0,Banker
1,Fahad,24.0,Data Scientist
2,Sizan,24.0,.Net Developer
3,,24.0,System Architect


In [9]:
# Drop missing data
df.dropna(inplace=True)
df

Unnamed: 0,Name,Age,Occupation
0,Avik,25.0,Banker
1,Fahad,24.0,Data Scientist
2,Sizan,24.0,.Net Developer


### Data Cleaning example another

In [11]:
import pandas as pd
from scipy import stats

df = pd.DataFrame({
    "Transaction": [100,150,200,250,300,9999]
})

In [12]:
df['z_score'] = stats.zscore(df['Transaction'])
df

Unnamed: 0,Transaction,z_score
0,100,-0.474523
1,150,-0.460833
2,200,-0.447144
3,250,-0.433454
4,300,-0.419765
5,9999,2.235719


In [14]:
df['winsorzied_amount'] = stats.mstats.winsorize(df["Transaction"],limits=0.05)
df

Unnamed: 0,Transaction,z_score,winsorzied_amount
0,100,-0.474523,100
1,150,-0.460833,150
2,200,-0.447144,200
3,250,-0.433454,250
4,300,-0.419765,300
5,9999,2.235719,9999


In [15]:
threshold = 500

df_truncate = df[df["Transaction"] <= threshold]
df_truncate

Unnamed: 0,Transaction,z_score,winsorzied_amount
0,100,-0.474523,100
1,150,-0.460833,150
2,200,-0.447144,200
3,250,-0.433454,250
4,300,-0.419765,300


### Level Encoder

### Data Transformation

In [16]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame({"Rating": ['Low','Medium','High','Medium','Low']})

In [17]:
le = LabelEncoder()

In [18]:
df['Encoded_rating'] = le.fit_transform(df['Rating'])

In [19]:
df

Unnamed: 0,Rating,Encoded_rating
0,Low,1
1,Medium,2
2,High,0
3,Medium,2
4,Low,1


In [20]:
pd.get_dummies(df['Rating'],drop_first=False)

Unnamed: 0,High,Low,Medium
0,False,True,False
1,False,False,True
2,True,False,False
3,False,False,True
4,False,True,False


### Numerical Data Labeling

In [21]:
from sklearn.preprocessing import MinMaxScaler

df = pd.DataFrame({"Value":[10,20,30,40,50]})

In [23]:
scaler = MinMaxScaler()

df['Scaled_Value'] = scaler.fit_transform(df['Value'].values.reshape(-1,1))
df

Unnamed: 0,Value,Scaled_Value
0,10,0.0
1,20,0.25
2,30,0.5
3,40,0.75
4,50,1.0


### Feature Engineering

In [24]:
df = pd.DataFrame({"Age":[22,34,55,46,63]})
df

Unnamed: 0,Age
0,22
1,34
2,55
3,46
4,63


In [25]:
bins = [0,30,50,100]
labels = ['Young','Middle-age','Senior']

In [27]:
df['bin_age'] = pd.cut(df['Age'],bins = bins, labels = labels)
df

Unnamed: 0,Age,bin_age
0,22,Young
1,34,Middle-age
2,55,Senior
3,46,Middle-age
4,63,Senior


In [28]:
from sklearn.preprocessing import PolynomialFeatures

df = pd.DataFrame({"size":[550,700,900],"rooms":[2,3,4]})

In [29]:
poly_f = PolynomialFeatures(degree=2,include_bias=False)

df_poly = pd.DataFrame(poly_f.fit_transform(df),columns=poly_f.get_feature_names_out(df.columns))

In [30]:
df

Unnamed: 0,size,rooms
0,550,2
1,700,3
2,900,4


In [31]:
df_poly

Unnamed: 0,size,rooms,size^2,size rooms,rooms^2
0,550.0,2.0,302500.0,1100.0,4.0
1,700.0,3.0,490000.0,2100.0,9.0
2,900.0,4.0,810000.0,3600.0,16.0


### Imbalance DataSet

In [32]:
from sklearn.utils import resample

df = pd.DataFrame({"class":[0,0,1,0,0,1,0],"feature":[1,2,3,4,5,6,7]})

In [33]:
df_majority = df[df['class'] == 0]
df_majority

Unnamed: 0,class,feature
0,0,1
1,0,2
3,0,4
4,0,5
6,0,7


In [34]:
df_miority = df[df['class'] == 1]
df_miority

Unnamed: 0,class,feature
2,1,3
5,1,6


In [36]:
df_undersample = resample(df_majority,replace=False,n_samples=len(df_miority),random_state=42)
df_undersample

Unnamed: 0,class,feature
1,0,2
6,0,7


In [38]:
balanced_data = pd.concat([df_miority,df_undersample])

In [39]:
balanced_data

Unnamed: 0,class,feature
2,1,3
5,1,6
1,0,2
6,0,7
