# Hands-on Lab: Working with different file formats

Estimated time: **40 mins**


# Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1.  <a href="#Data-Engineering">Data Engineering</a>
2.  <a href="#Data-Engineering-Process">Data Engineering Process</a>
3.  <a href="#Working-with-different-file-formats">Working with different file formats</a>
4.  <a href="#Data-Analysis">Data Analysis</a>

</font>
</div>


# Data Engineering

**Data engineering** is one of the most critical and foundational skills in any data scientist’s toolkit.

In [4]:
!pip install seaborn lxml openpyxl



In [5]:
import pandas as pd
import seaborn as sns
import lxml
import openpyxl


In [6]:
df = pd.read_csv("Addresses.csv", header=None)

In [7]:
df

Unnamed: 0,0,1,2,3,4,5
0,John,Doe,120 jefferson st.,Riverside,NJ,8075
1,Jack,McGinnis,220 hobo Av.,Phila,PA,9119
2,"John ""Da Man""",Repici,120 Jefferson St.,Riverside,NJ,8075
3,Stephen,Tyler,"7452 Terrace ""At the Plaza"" road",SomeTown,SD,91234
4,,Blankman,,SomeTown,SD,298
5,"Joan ""the bone"", Anne",Jet,"9th, at Terrace plc",Desert City,CO,123


#### Adding column name to the DataFrame

We can add columns to an existing DataFrame using its **columns** attribute.


In [8]:
df.columns = ['First Name', 'Last Name', 'Location', 'City', 'State', 'Area Code']

In [9]:
df

Unnamed: 0,First Name,Last Name,Location,City,State,Area Code
0,John,Doe,120 jefferson st.,Riverside,NJ,8075
1,Jack,McGinnis,220 hobo Av.,Phila,PA,9119
2,"John ""Da Man""",Repici,120 Jefferson St.,Riverside,NJ,8075
3,Stephen,Tyler,"7452 Terrace ""At the Plaza"" road",SomeTown,SD,91234
4,,Blankman,,SomeTown,SD,298
5,"Joan ""the bone"", Anne",Jet,"9th, at Terrace plc",Desert City,CO,123


#### Selecting a single column

To select the first column 'First Name', you can pass the column name as a string to the indexing operator.


In [10]:
df["First Name"]

0                     John
1                     Jack
2            John "Da Man"
3                  Stephen
4                      NaN
5    Joan "the bone", Anne
Name: First Name, dtype: object

#### Selecting multiple columns

To select multiple columns, you can pass a list of column names to the indexing operator.


In [16]:
df[['First Name', 'Last Name', 'Location', 'City', 'State', 'Area Code']]
df

Unnamed: 0,First Name,Last Name,Location,City,State,Area Code
0,John,Doe,120 jefferson st.,Riverside,NJ,8075
1,Jack,McGinnis,220 hobo Av.,Phila,PA,9119
2,"John ""Da Man""",Repici,120 Jefferson St.,Riverside,NJ,8075
3,Stephen,Tyler,"7452 Terrace ""At the Plaza"" road",SomeTown,SD,91234
4,,Blankman,,SomeTown,SD,298
5,"Joan ""the bone"", Anne",Jet,"9th, at Terrace plc",Desert City,CO,123


#### Selecting rows using .iloc and .loc

Now, let's see how to use .loc for selecting rows from our DataFrame.

**loc() : loc() is label based data selecting method which means that we have to pass the name of the row or column which we want to select.**


In [17]:
# to select the first row
df.loc[0]

First Name                 John
Last Name                   Doe
Location      120 jefferson st.
City                  Riverside
State                        NJ
Area Code                  8075
Name: 0, dtype: object

In [23]:
# To select the 0th,1st and 2nd row of "First Name" column only
df.loc[[0,1,2], "First Name"]

0             John
1             Jack
2    John "Da Man"
Name: First Name, dtype: object

Now, let's see how to use .iloc for selecting rows from our DataFrame.

**iloc() : iloc() is a indexed based selecting method which means that we have to pass integer index in the method to select specific row/column.**


In [24]:
# To select the 0th,1st and 2nd row of "First Name" column only
df.iloc[[0,1,2], 0]

0             John
1             Jack
2    John "Da Man"
Name: First Name, dtype: object

### Transform Function in Pandas

Python's Transform function returns a self-produced dataframe with transformed values after applying the function specified in its parameter.

Let's see how Transform function works.


In [25]:
import pandas as pd
import numpy as np

In [26]:
# creating the transform function
df = pd.DataFrame(np.array([[1,2,3], [4,5,6], [7,8,9]]), columns = ['a', 'b', 'c'])
df

Unnamed: 0,a,b,c
0,1,2,3
1,4,5,6
2,7,8,9


Let’s say we want to add 10 to each element in a dataframe:

In [None]:
# applying the transform function
df = pd