## 1 - SetUp Environment

In [1]:
import numpy as np
import pandas as pd
import pickle

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 2 - Load Dataframe

now we should load the dataframe that we saved in the previous section.

In [2]:
with open('/content/drive/MyDrive/Python/Regression/Assets/df(1.4).pickle', 'rb') as file:
    df = pickle.load(file)

df.head(3)

Unnamed: 0,Year,Month,Week Day,Duration,Cost,Team Member,Height,Frequency,Signal Strength,Antenna Type,Orientation,Power Supply,Zone
0,2019,3,0,241.0,516773.0,12.0,24.0,Very Low Frequencies (VLF),3,Dielectric,Omni-directional,Solar-powered,North
1,2019,10,2,608.0,954888.0,22.0,42.0,Very Low Frequencies (VLF),3,Dielectric,Circular2,Active,Center
2,2019,6,5,772.0,932640.0,14.0,43.0,Very High Frequencies (VHF),3,,Horizontal,Active,Center


## 3 - Declare Variable

In this section, I work on just text categorical variables and I try to clean and preprocess them. The text categorical variables of this dataset are:</br>
<li> Frequency
<li> Antenna Type
<li> Orientation
<li> Power Supply
<li> Zone

In [3]:
CatTxt_list = ['Frequency' , 'Antenna Type', 'Orientation', 'Power Supply', 'Zone']

## 4 - Data Type Conversion

the first thing that should be check is that we must make sure about data type. this variable must be string.

In [4]:
df.dtypes

Year                 int64
Month                int64
Week Day             int64
Duration           float64
Cost               float64
Team Member        float64
Height             float64
Frequency           object
Signal Strength      int64
Antenna Type        object
Orientation         object
Power Supply        object
Zone                object
dtype: object

we should make sure about data type of these variables and change them to string.

In [5]:
for i in range(len(CatTxt_list)):
    df[CatTxt_list[i]] = df[CatTxt_list[i]].astype(str)

## 5 - Handle Null Values

First we must make sure whether we have null values in these variables or not.

In [6]:
def null_checker():
    for i in range(len(CatTxt_list)):
        print(CatTxt_list[i] , ':' , df.loc[df[CatTxt_list[i]].isnull()].index)

Then we run the function for all numerical variables

In [7]:
null_checker()

Frequency : Int64Index([], dtype='int64')
Antenna Type : Int64Index([], dtype='int64')
Orientation : Int64Index([], dtype='int64')
Power Supply : Int64Index([], dtype='int64')
Zone : Int64Index([], dtype='int64')


the result shows there is no null value.

## 6 - Possible Range

according to data document, those variables just can contain these value:</br>
<li> <b>Antenna type:</b> <i> Wire , Aperture , Reflector , Array , Printed Circuit Board (PCB) , Dielectric</i>
<li> <b>Orientation:</b> <i> Horizontal , Vertical , Circular , Omni-directional</i>
<li> <b>Frequency:</b> <i> Very Low Frequencies (VLF) , Low Frequencies (LF) , Medium Frequencies (MF) , High Frequencies (HF) , Very High Frequencies (VHF) , Ultra-High Frequencies (UHF) , Super-High Frequencies (SHF) , Extremely High Frequencies (EHF)</i>
<li> <b>Power Supply:</b> <i> Passive , Active , Power over Ethernet , Solar-powered , Battery-powered</i>
<li> <b>Zone:</b> <i> North , South , West , East , Center</i>

In [8]:
def possible_cate(x):
    print(x, ':', sorted(df[x].unique()))
    print('\n')
    print(df[x].value_counts())

### 6.1 - Frequency

In [9]:
possible_cate('Frequency')

Frequency : ['Extremely High Frequencies (EHF)', 'High Frequencies (HF)', 'Low Frequencies (LF)', 'Medium Frequencies (MF)', 'Super-High Frequencies (SHF)', 'Ultra-High Frequencies (UHF)', 'Very High Frequencies (VHF)', 'Very Low Frequencies (VLF)']


High Frequencies (HF)               56
Super-High Frequencies (SHF)        44
Very Low Frequencies (VLF)          39
Low Frequencies (LF)                36
Ultra-High Frequencies (UHF)        34
Medium Frequencies (MF)             33
Very High Frequencies (VHF)         31
Extremely High Frequencies (EHF)    26
Name: Frequency, dtype: int64


### 6.2 - Antenna Type

In [10]:
possible_cate('Antenna Type')

Antenna Type : ['Aperture', 'Array', 'Dielectric', 'Dielectric &8', 'Dielectric ..', 'Diiielectric', 'Printed Circuit Board (PCB)', 'Reflector', 'Wire', 'nan', 'synab']


Printed Circuit Board (PCB)    59
Aperture                       57
Reflector                      44
Array                          41
Dielectric                     37
Wire                           34
nan                            23
Dielectric &8                   1
Dielectric ..                   1
synab                           1
Diiielectric                    1
Name: Antenna Type, dtype: int64


we see there are some values that are not acceptable. we should handle them. some of them were supposed to be one sepcific category but there are some spelling error and we can correct them.

In [11]:
replacements_Antenna_Type = {
    'Dielectric &8': 'Dielectric', 
    'Dielectric ..': 'Dielectric', 
    'Diiielectric': 'Dielectric'}

df['Antenna Type'] = df['Antenna Type'].replace(replacements_Antenna_Type)
possible_cate('Antenna Type')

Antenna Type : ['Aperture', 'Array', 'Dielectric', 'Printed Circuit Board (PCB)', 'Reflector', 'Wire', 'nan', 'synab']


Printed Circuit Board (PCB)    59
Aperture                       57
Reflector                      44
Array                          41
Dielectric                     40
Wire                           34
nan                            23
synab                           1
Name: Antenna Type, dtype: int64


but about the other categories, we have to impute with mode. the mode is "Printed Circuit Board (PCB)".

In [12]:
replacements_Antenna_Type_other = {
    'nan': 'Printed Circuit Board (PCB)', 
    'synab': 'Printed Circuit Board (PCB)'}

df['Antenna Type'] = df['Antenna Type'].replace(replacements_Antenna_Type_other)
possible_cate('Antenna Type')

Antenna Type : ['Aperture', 'Array', 'Dielectric', 'Printed Circuit Board (PCB)', 'Reflector', 'Wire']


Printed Circuit Board (PCB)    83
Aperture                       57
Reflector                      44
Array                          41
Dielectric                     40
Wire                           34
Name: Antenna Type, dtype: int64


### 6.3 - Orientation

In [13]:
possible_cate('Orientation')

Orientation : ['Circular', 'Circular2', 'Horizontal', 'Horizontal&8', 'Horizontal2', 'Omni-directional', 'Vertical']


Horizontal          81
Omni-directional    75
Vertical            70
Circular            69
Circular2            2
Horizontal2          1
Horizontal&8         1
Name: Orientation, dtype: int64


we see there are some values that are not acceptable. we should handle them. some of them were supposed to be one sepcific category but there are some spelling error and we can correct them.

In [14]:
replacements_Orientation = {
    'Circular2': 'Circular', 
    'Horizontal2': 'Horizontal', 
    'Horizontal&8': 'Horizontal'}

df['Orientation'] = df['Orientation'].replace(replacements_Orientation)
possible_cate('Orientation')

Orientation : ['Circular', 'Horizontal', 'Omni-directional', 'Vertical']


Horizontal          83
Omni-directional    75
Circular            71
Vertical            70
Name: Orientation, dtype: int64


### 6.4 - Power Supply

In [15]:
possible_cate('Power Supply')

Power Supply : ['Active', 'Battery-powered', 'Passive', 'PoE', 'Solar-powered', 'nan']


Battery-powered    72
Active             56
PoE                55
Passive            54
Solar-powered      49
nan                13
Name: Power Supply, dtype: int64


for string nan, we have to impute with mode. the mode is "Battery-powered".

In [16]:
replacements_Power_Supply = {
    'nan': 'Battery-powered'}

df['Power Supply'] = df['Power Supply'].replace(replacements_Power_Supply)
possible_cate('Power Supply')

Power Supply : ['Active', 'Battery-powered', 'Passive', 'PoE', 'Solar-powered']


Battery-powered    85
Active             56
PoE                55
Passive            54
Solar-powered      49
Name: Power Supply, dtype: int64


### 6.5 - Zone

In [17]:
possible_cate('Zone')

Zone : ['Center', 'East', 'North', 'South', 'West', 'nan']


North     64
West      61
East      61
Center    56
South     53
nan        4
Name: Zone, dtype: int64


for string nan, we have to impute with mode. the mode is "Battery-powered"

In [18]:
replacements_Zone = {
    'nan': 'North'}

df['Zone'] = df['Zone'].replace(replacements_Zone)
possible_cate('Zone')

Zone : ['Center', 'East', 'North', 'South', 'West']


North     68
West      61
East      61
Center    56
South     53
Name: Zone, dtype: int64


## Check Point

In [19]:
with open('/content/drive/MyDrive/Python/Regression/Assets/df(1.Final).pickle', 'wb') as file:
    pickle.dump(df, file)