## 1 - Setup Environment

first we should import the most important libraries. However, during the project, we may import other libraries as well.

In [19]:
import numpy as np
import pandas as pd
import pickle

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 2 - Load Dataset

then we import the dataset, and we should make sure about the file directory.

In [20]:
dataset = pd.read_csv('/content/drive/MyDrive/Python/Regression/Assets/dataset.csv')
dataset.head(3)

Unnamed: 0,Project ID,Team member,Start Date,Duration(day),Cost(K),height,type,orientation,Frequency,Signal strength (dbi),Power supply,zone
0,BTS_BoGSyT_001,12,3/4/2019,241.0,"$516,773",24,Dielectric,Omni-directional,Very Low Frequencies (VLF),3,Solar-powered,North
1,BTS_BoGSyT_002,22,10/2/2019,608.0,"$954,888",42,Dielectric,Circular2,Very Low Frequencies (VLF),3,Active,Center
2,BTS_BoGSyT_003,^24,3/24/2017,,"$1,944,529",27,Array,Circular,Very Low Frequencies (VLF),5,Solar-powered,North


## 3 - Check Data Type

we can see all data type of variables.

In [21]:
print(dataset.dtypes)

Project ID               object
Team member              object
Start Date               object
Duration(day)            object
 Cost(K)                 object
height                   object
type                     object
orientation              object
Frequency                object
Signal strength (dbi)    object
Power supply             object
zone                     object
dtype: object


the results shows there are lots of wrong data types and during the project, we have to change them. the desired data type of variables are:<br>
<li><i>Project ID</i>: string - [Text]
<li><i>Team member</i>: integer [Numerical]
<li><i>Start Date</i>: object [Date]
<li><i>Duration(day)</i>: integer [Numerical]
<li><i>Cost(K)</i>: integer [Numerical]
<li><i>height</i>: integer [Numerical]
<li><i>type</i>: string [Categorical ~ Nominal]
<li><i>orientation</i>: string [Categorical ~ Nominal]
<li><i>Frequency</i>: string [Categorical ~ Ordinal]
<li><i>Signal strength (dbi)</i>: integer [Categorical ~ Ordinal]
<li><i>Power supply</i>: string [Categorical ~ Nominal]
<li><i>zone</i>: string [Categorical ~ Nominal]

if you are more interested about the data, you can read the data documentation file.

## 4 - Make a Copy

to reserve the original dataset, first we should make a copy

In [22]:
df = dataset.copy()

## 5 - Duplicate Rows

in this step, we should make sure there is not full duplicated rows as weel as there is no two similar rows in terms of Project ID. First let's see the data dimension, before remove duplicated Project ID:

In [23]:
df.shape

(342, 12)

Let's check to see if there could be some rows that are similar in all variables or not, except project id

In [24]:
mask = df.duplicated(subset='Project ID')
df = df[~(mask & df['Project ID'].duplicated())]

In [25]:
df.shape

(333, 12)

the result shows we have 9 duplicated rows. now, the df has 333 rows instead of 342

## 6 - Un-needed Variable

the first variable is just lable and this variable cannot help the model's accuracy. so for preventing complexity, we remove this variable

In [26]:
df = df.drop(['Project ID'] , axis = 1)
print(df.shape)

(333, 11)


## 7 - Change Order and Name

to have better viow on data and variable, let's change their orders and their name. By this way, we can preprocess and clean them with ease.

Let's grasp all variables' name:

In [27]:
print(df.columns)

Index(['Team member', 'Start Date', 'Duration(day)', ' Cost(K) ', 'height',
       'type', 'orientation', 'Frequency', 'Signal strength (dbi)',
       'Power supply', 'zone'],
      dtype='object')


then we can change the names to what are written in data documentation.

In [28]:
col_name = [
    'Team Member',
    'Start Date',
    'Duration',
    'Cost',
    'Height',
    'Antenna Type',
    'Orientation',
    'Frequency',
    'Signal Strength',
    'Power Supply',
    'Zone',
]

df.columns = col_name

print(df.columns)

Index(['Team Member', 'Start Date', 'Duration', 'Cost', 'Height',
       'Antenna Type', 'Orientation', 'Frequency', 'Signal Strength',
       'Power Supply', 'Zone'],
      dtype='object')


now, since we clean and preprocess data based on their type, it's better to order them.

In [29]:
order = [
    'Start Date',
    'Duration', 
    'Cost',
    'Team Member', 
    'Height',
    'Frequency', 
    'Signal Strength',
    'Antenna Type', 
    'Orientation', 
    'Power Supply', 
    'Zone']

df = df[order]

In [31]:
df.head(5)

Unnamed: 0,Start Date,Duration,Cost,Team Member,Height,Frequency,Signal Strength,Antenna Type,Orientation,Power Supply,Zone
0,3/4/2019,241.0,"$516,773",12,24,Very Low Frequencies (VLF),3,Dielectric,Omni-directional,Solar-powered,North
1,10/2/2019,608.0,"$954,888",22,42,Very Low Frequencies (VLF),3,Dielectric,Circular2,Active,Center
2,3/24/2017,,"$1,944,529",^24,27,Very Low Frequencies (VLF),5,Array,Circular,Solar-powered,North
3,8/10/2020,,"$542,578",21,^^25,Very Low Frequencies (VLF),0,Wiree,Horizontal,Active,West
4,6/29/2019,772.0,"$932,640",14,43,Very High Frequencies (VHF),3,,Horizontal,Active,Center


## Check Point

In [30]:
#with open('/content/drive/MyDrive/Python/Assets/df(1.1).pickle', 'wb') as file:
    #pickle.dump(df, file)