### Dealing with Mixed Variables

- Mixed variables are those whose values can contain both numbers and labels.
For example vehicle registration number contains both letters and numbers like MH12DK 8765
- One way of engineering these features is to extract the categorical part in one variable, and the numerical part in a different variable—thus, we obtain two variables.
- Then we apply to the newly created variable the feature engineering techniques as we would with any other categorical or numerical features.


### Scenarios

#### Scenario-1 
Numbers and text i.e. labels are in different observations

In [1]:
import numpy as np
import pandas as pd

In [72]:
df = pd.DataFrame({'Rollno': np.arange(1,11),
             'Grade' : ['B', 'C', 'A', 1, 2, 3, 'B', 'A', 2, 1]})
df.head()

Unnamed: 0,Rollno,Grade
0,1,B
1,2,C
2,3,A
3,4,1
4,5,2


In [73]:
# - If 'coerce', then invalid parsing will be set as NaN.
df['Grade_num'] = pd.to_numeric(df['Grade'], errors='coerce').astype('Int64')
df.head()

Unnamed: 0,Rollno,Grade,Grade_num
0,1,B,
1,2,C,
2,3,A,
3,4,1,1.0
4,5,2,2.0


In [74]:
df['Grade_label'] = np.where(df['Grade_num'].isna(), df['Grade'],np.nan)

In [75]:
df.head(6)

Unnamed: 0,Rollno,Grade,Grade_num,Grade_label
0,1,B,,B
1,2,C,,C
2,3,A,,A
3,4,1,1.0,
4,5,2,2.0,
5,6,3,3.0,


In [76]:
df.isna().sum()

Rollno         0
Grade          0
Grade_num      5
Grade_label    5
dtype: int64

#### Scenario-2
Labels and numbers in same observations

In [77]:
df = pd.DataFrame({'Company': ['Maruti', 'Hyundai', 'Honda', 'Tata', 'Toyota'],
             'VehicleNum' : ['MH12EK1123', 'KA12HZ4144', 'UP34AB2876', 'DL01AB1234', 'PN12CD3344']})

df.head()

Unnamed: 0,Company,VehicleNum
0,Maruti,MH12EK1123
1,Hyundai,KA12HZ4144
2,Honda,UP34AB2876
3,Tata,DL01AB1234
4,Toyota,PN12CD3344


In [78]:
veh_df = df.VehicleNum.str.extractall(r"(?P<Veh_StCode>[A-Z]{2})(?P<Veh_CtyCode>\d{2})(?P<Veh_label>[A-Z]+)(?P<Veh_digits>\d+)").reset_index()
veh_df

Unnamed: 0,level_0,match,Veh_StCode,Veh_CtyCode,Veh_label,Veh_digits
0,0,0,MH,12,EK,1123
1,1,0,KA,12,HZ,4144
2,2,0,UP,34,AB,2876
3,3,0,DL,1,AB,1234
4,4,0,PN,12,CD,3344


In [79]:
veh_df.drop(columns = ["level_0", "match"], inplace=True)

In [82]:
df = df.merge(veh_df, left_index=True, right_index=True)

In [83]:
df

Unnamed: 0,Company,VehicleNum,Veh_StCode,Veh_CtyCode,Veh_label,Veh_digits
0,Maruti,MH12EK1123,MH,12,EK,1123
1,Hyundai,KA12HZ4144,KA,12,HZ,4144
2,Honda,UP34AB2876,UP,34,AB,2876
3,Tata,DL01AB1234,DL,1,AB,1234
4,Toyota,PN12CD3344,PN,12,CD,3344


In [85]:
df.drop(columns=["VehicleNum"], inplace=True)

In [96]:
to_convert = ['Veh_StCode', 'Veh_CtyCode','Veh_label']

df[to_convert] = df[to_convert].astype('category')

In [97]:
df.dtypes

Company          object
Veh_StCode     category
Veh_CtyCode    category
Veh_label      category
Veh_digits        int32
dtype: object

In [98]:
df

Unnamed: 0,Company,Veh_StCode,Veh_CtyCode,Veh_label,Veh_digits
0,Maruti,MH,12,EK,1123
1,Hyundai,KA,12,HZ,4144
2,Honda,UP,34,AB,2876
3,Tata,DL,1,AB,1234
4,Toyota,PN,12,CD,3344
