# Week 6 - Data Wrangling

![data_wrangling.png](attachment:data_wrangling.png)

##  Table of Contents

- Theoretical Overview
- Problem Statement
- Code
    - Importing packages & libraries
    - Duplicated function
    - Map function
    - Replace function
    - Rename function
    - Describe function
    - GetDummies function
    - Quiz

## Theoretical Overview
- Most real-world data are dirty. We must first convert datasets before we can analyze them.
- Data wrangling refers to several procedures intended to convert unstructured data into formats that are easier to work with.
- Data wrangling transforms data from an unorganised or untidy source into something valuable.
- Data Wrangling consists of 6 steps:
    1. Discovery
    2. Structuring
    3. Cleaning
    4. Enriching
    5. Validating
    6. Publishing
- Data transformation is the technological process of translating data from one format, standard, or structure to another without affecting the content of the datasets.
- Data transformation may include:
    1. Constructive (adding, copying)
    2. Destructive (deleting fields and records)
    3. Structural (renaming, moving, and combining of columns)
- In this activity, we will briefly touch on the following subjects: duplicated(), map(), replace(), rename(), cut(), describe(), get_dummies()

## Problem Statement

This notebook is based on the theory and tutorial covered in the slides 63 and 75 of Data Wrangling. In this activity, you have to create multiple dataframes and then apply various data cleaning techniques such as **drop_duplicates, mapping, replace, rename, get_dummies, etc**

## Code

 ### Importing of required libraries:

In [1]:
import pandas as pd
import numpy as np

 ### Duplicated function:

Creating a dataset:

In [2]:
df_d = pd.DataFrame({"a":["one","two"]*3,
                    "b": [1,1,2,3,2,3]})

df_d

Unnamed: 0,a,b
0,one,1
1,two,1
2,one,2
3,two,3
4,one,2
5,two,3


This function checks whether the row is repeated or not:

In [3]:
df_d.duplicated()

0    False
1    False
2    False
3    False
4     True
5     True
dtype: bool

This function is to drop the duplicate records in the DataFrame:

In [4]:
df_d.drop_duplicates()

Unnamed: 0,a,b
0,one,1
1,two,1
2,one,2
3,two,3


 ### Map function:

The following code creates a DataFrame:

In [5]:
df_m = pd.DataFrame({"names":["Olivia","Amelia","Isabelle","Mia","Ella"],
                    "scores":[50,32,67,32,21]})

df_m

Unnamed: 0,names,scores
0,Olivia,50
1,Amelia,32
2,Isabelle,67
3,Mia,32
4,Ella,21


 We can transfer values of data in a DataFrame with the function "map()"

 So for example, the name "Olivia" will be mapped to "O" and the name "Amelia" will be mapped to "A".

 The code below creates a class of "Names" it will be mapped to:

In [6]:
classes = {"Olivia":"O","Amelia":"A","Isabelle":"I","Mia":"M","Ella":"E"}

 The following code will do the mapping:

In [7]:
df_m["Groupings"] = df_m["names"].map(classes)

df_m

Unnamed: 0,names,scores,Groupings
0,Olivia,50,O
1,Amelia,32,A
2,Isabelle,67,I
3,Mia,32,M
4,Ella,21,E


 ### Replace function:

The following code is the creation of a Series object:

In [8]:
df_r = pd.Series([67,21,79,39])

df_r

0    67
1    21
2    79
3    39
dtype: int64

We can replace values in python using the function "replace()".

This is because the function "replace()" takes in 2 arguments.

First is the value you want to replace, and Second is the value you would like to replace it with.

The following is a breakdown of the function:

replace(value_to_be_replaced , value_to_replace_to)



 The following code is an example that will replace the value "67" with the value "0"

In [33]:
df_r.replace(67,0)

0     0
1    21
2    79
3    39
dtype: int64

 The following code will replace multiple values with the replace() function:

In [10]:
df_r.replace([21,79],[37,38])

0    67
1    37
2    38
3    39
dtype: int64

 ### Rename function:

The following code is a creation of a DataFrame:

In [11]:
df_re = pd.DataFrame(np.arange(12).reshape(3,4), index=[0,1,2], columns=['sam', 'jeslyn', 'kish', 'dan'])

df_re

Unnamed: 0,sam,jeslyn,kish,dan
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


We can rename axes in a dataframe with the help of the rename() function.

We will rename the columns from lowercase to uppercase.

The following code will change column names from lowercase to uppercase:

In [12]:
df_re.rename(columns = str.upper)

Unnamed: 0,SAM,JESLYN,KISH,DAN
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


We can also change the row or column names using the "rename()" function.

Now we will rename the index from "0" to "zero"

Code to replace index:

In [13]:
df_re.rename(index={0:"zero"})

Unnamed: 0,sam,jeslyn,kish,dan
zero,0,1,2,3
1,4,5,6,7
2,8,9,10,11


Now we will rename the column from "sam" to "spade"

Code to replace column:

In [14]:
df_re.rename(columns={"sam":"spade"})

Unnamed: 0,spade,jeslyn,kish,dan
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


 ### Describe function:

The following code generates a DataFrame:

In [15]:
df_desc = pd.DataFrame(np.random.randn(2000,5))

df_desc

Unnamed: 0,0,1,2,3,4
0,-1.791395,-1.698663,0.707392,-0.543470,0.730230
1,2.271865,-2.595773,2.234847,0.510553,-0.677354
2,1.654614,-0.311770,0.780638,-1.216943,-1.207879
3,-0.972146,2.328212,-1.565868,-0.619114,-2.902758
4,-0.179823,0.184664,0.530264,1.087014,1.281995
...,...,...,...,...,...
1995,-0.134127,1.294891,0.652396,0.356346,-0.138065
1996,-0.063177,-0.549013,0.285147,-0.441043,-0.958336
1997,0.349048,0.868328,-0.090979,1.560201,2.580582
1998,-0.914124,1.623574,0.512228,1.649902,-0.788340


We can find specific values/statistical summaries in a dataset.

With the help of the describe() function,  we can get summary statistics of the DataFrame.

In [16]:
df_desc.describe()

Unnamed: 0,0,1,2,3,4
count,2000.0,2000.0,2000.0,2000.0,2000.0
mean,0.014645,0.006536,-0.006379,-0.005338,-0.029978
std,1.003527,1.019575,1.004883,0.979012,1.005141
min,-3.452063,-3.301635,-3.359169,-3.408335,-3.210236
25%,-0.673028,-0.678604,-0.708562,-0.660512,-0.706842
50%,0.033176,-0.006661,-0.023815,-0.005099,-0.07389
75%,0.664463,0.663652,0.670048,0.650592,0.654371
max,3.87829,4.297517,4.225517,3.456365,4.099078


Here we can see that there is a breakdown of the following:

count (The number of records in the DataFrame)

mean (The average of all values in the DataFrame)

std (Std stands for Standard Deviation, the measure of the variation or dispersion of a set of values.)

min (The minimum value in the DataFrame)

25% (25th Percentile, known as first or lower quartile.)

50% (50th Percentile, known as the median. The median cuts the Data in half)

75% (75th Percentile, known as third or higher quartile)

max (The maximum value in the DataFrame)

### get_dummies()

The following code generates a DataFrame:

In [17]:
df_d = pd.DataFrame({"Letter":["a","b"]*3,
                    "Number": [0,1,2,3,4,5]})

df_d

Unnamed: 0,Letter,Number
0,a,0
1,b,1
2,a,2
3,b,3
4,a,4
5,b,5


get_dummies() function converts a categorical variable into dummy/indicator variables.

You can read more about this function from the pandas library documentation page:
https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html

We can use the "get_dummies()" function to convert a categorical variable into a "dummy" or "indicator.

In [18]:
pd.get_dummies(df_d["Letter"])

Unnamed: 0,a,b
0,1,0
1,0,1
2,1,0
3,0,1
4,1,0
5,0,1


The letter "a" will reflect 1 if it is "a" in the a column.

The letter "b" will reflect 1 if its "b" in the b column.



 ## Quiz time!

 #### Question 1:

 The code for the creation of DataFrame will be given below. Simply just run it and proceed with the other questions.

In [19]:
q1 = pd.DataFrame({"a":["Four","Five"]*3,
                    "b": [4,5,4,5,4,5]})

q1

Unnamed: 0,a,b
0,Four,4
1,Five,5
2,Four,4
3,Five,5
4,Four,4
5,Five,5


 What is the code used to Check for duplicates in the DataFrame stored in the variable "q1"? (Created above)

 Please type the code below to CHECK for duplicates:

In [20]:
#Code
q1.duplicated()


0    False
1    False
2     True
3     True
4     True
5     True
dtype: bool

 What is the code used to Drop duplicates in the DataFrame stored in the variable "q1"? (Created above)

 Please type the code below to DROP duplicates:

In [22]:
#Code
q1.drop_duplicates()

Unnamed: 0,a,b
0,Four,4
1,Five,5


 #### Question 2:

 The code for the creation of DataFrame will be given below. Simply just run it and proceed with the other questions.

In [23]:
q2 = pd.DataFrame({"names":["Kelly","Oliver","Kenneth","Bill","Darren"],
                    "scores":[50,32,67,32,21]})

q2

Unnamed: 0,names,scores
0,Kelly,50
1,Oliver,32
2,Kenneth,67
3,Bill,32
4,Darren,21


 What is the code used to MAP values in a DataFrame?

 Please type the code below to MAP values for the following:

 - Kelly will be mapped as "K"
 - Olivier will be mapped as "O"
 - Kenneth will be mapped as "K"
 - Bill will be mapped as "B"
 - Darren will be mapped as "D"



In [24]:
# Code
classes = {"Kelly":"K","Oliver":"O","Kenneth":"K","Bill":"B","Darren":"D"}


Type the code to do the mapping below:

In [26]:
# Code
q2["Groupings"] = q2["names"].map(classes)
print(q2)

     names  scores Groupings
0    Kelly      50         K
1   Oliver      32         O
2  Kenneth      67         K
3     Bill      32         B
4   Darren      21         D


 #### Question 3:

The code for the creation of a Series object will be given below. Simply just run it and proceed with the other questions.

In [36]:
q3 = pd.Series([93,23,37,99])

q3

0    93
1    23
2    37
3    99
dtype: int64

 What is the code used to REPLACE values in a DataFrame?

 Please type the code below to REPLACE values for the following:
 - Value "93" is to be replaced with "21"
 - Value "23" is to be replaced with "22"
 - Value "37" is to be replaced with "23"
 - Value "99" is to be replaced with "24"

Type the code below to replace Value "93" with "21":


In [41]:
# Code
q3.replace(93,21)


0    21
1    23
2    37
3    99
dtype: int64

Type the code below to replace Value "23" with "22":


In [42]:
# Code
q3.replace(23,22)


0    93
1    22
2    37
3    99
dtype: int64

Type the code below to replace Value "37" with "23":


In [44]:
# Code
q3.replace(37,23)


0    93
1    23
2    23
3    99
dtype: int64

Type the code below to replace Value "99" with "24":


In [45]:
# Code
q3.replace(99,24)

0    93
1    23
2    37
3    24
dtype: int64

 #### Question 4:

The code for the creation of DataFrame will be given below. Simply just run it and proceed with the other questions.

In [46]:
q4 = pd.DataFrame(np.arange(12).reshape(3,4), index=[0,1,2], columns=['calvin', 'jorddie', 'dom', 'allen'])

q4

Unnamed: 0,calvin,jorddie,dom,allen
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


What is the function that is used to change the column names from lowercase to uppercase?

Key in the code to change the column names from lowercase to UPPERCASE:

In [47]:
# Code
q4.rename(columns = str.upper)

Unnamed: 0,CALVIN,JORDDIE,DOM,ALLEN
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


We can also change the row or column names using the "rename()" function.

The function is to be used to rename the following indexes:
- Index "2" to be changed to "Two"

Key in the code to replace the index:

In [49]:
# Code
q4.rename(index={2:"two"})

Unnamed: 0,calvin,jorddie,dom,allen
0,0,1,2,3
1,4,5,6,7
two,8,9,10,11


###

Renaming the column from "calvin" to "bavier"

Key in the code to replace column:

In [52]:
# Code
q4.rename(columns={"calvin":"bavier"})

Unnamed: 0,bavier,jorddie,dom,allen
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


 #### Question 5:

The code for the creation of DataFrame will be given below. Simply just run it and proceed with the other questions.

In [53]:
q5 = pd.DataFrame(np.random.randn(2000,5))

q5

Unnamed: 0,0,1,2,3,4
0,-0.197909,-0.383082,-0.645310,0.107737,0.626859
1,-1.593988,0.214118,1.480095,-0.438086,-0.424645
2,-1.023350,-0.355082,1.771793,-0.402790,-0.456771
3,0.914228,-0.615503,1.079211,0.751370,-0.635752
4,-0.665505,-0.954159,-0.215477,-2.257835,0.724382
...,...,...,...,...,...
1995,1.439548,-0.473853,-0.592334,0.144436,0.079859
1996,-1.094881,-0.944198,-0.470673,-0.397718,0.255232
1997,-0.578697,0.492812,-1.304208,0.269738,0.238063
1998,1.692343,-1.824728,-1.150815,-1.710151,-0.634532


What is the function used to find specific values/statistical summaries in a DataFrame?

Key in the code to get the summary statistics of the DataFrame:

In [55]:
# Code
q5.describe()

Unnamed: 0,0,1,2,3,4
count,2000.0,2000.0,2000.0,2000.0,2000.0
mean,0.000398,0.008359,-0.02603,0.002115,-0.005978
std,0.970228,1.003228,1.008615,1.002378,0.975892
min,-2.63127,-3.151955,-3.927191,-3.616509,-3.181542
25%,-0.660145,-0.679143,-0.655024,-0.669165,-0.663633
50%,-0.017467,-0.010655,-0.043066,0.003923,-0.043481
75%,0.642609,0.708947,0.610152,0.677593,0.600563
max,3.039095,3.169863,3.937713,3.957937,3.783385


##### Name 2 of the statistic summary and what they mean:

Statistic summary 1:

In [59]:
# Key in answer here
'Mean value is the average of the related values, value sum/value count'

'Mean value is the average of the related values, value sum/value count'

Statistic summary 2:

In [58]:
# Key answer in here
'Min/Max values are the minimum and maximum values respectively...the smallest and largest values of the collection'

'Min/Max values are the minimum and maximum values respectively...the smallest and largest values of the collection'

 #### Question 6:

 The code for the creation of DataFrame will be given below. Simply just run it and proceed with the other questions.

In [56]:
q6 = pd.DataFrame({"Letter":["c","d"]*3,
                    "Number": [0,1,2,3,4,5]})

q6

Unnamed: 0,Letter,Number
0,c,0
1,d,1
2,c,2
3,d,3
4,c,4
5,d,5


 What is the function used to create dummies?

 Create dummies value with the column "Letter".

Key in the code used to create dummies in a DataFrame:

In [57]:
# Code
pd.get_dummies(q6["Letter"])

Unnamed: 0,c,d
0,1,0
1,0,1
2,1,0
3,0,1
4,1,0
5,0,1


 # Congratulations on completing this activity!