# Data wrangling

Data wrangling is a broad term used, often informally, to describe the process of
transforming raw data to a clean and organized format ready for use. For us, data
wrangling is only one step in preprocessing our data, but it is an important step.

The most common data structure used to “wrangle” data is the data frame, which can
be both intuitive and incredibly versatile. Data frames are tabular, meaning that they
are based on rows and columns like you would see in a spreadsheet

**[To Know More About Markdown Latext](https://csrgxtu.github.io/2015/03/20/Writing-Mathematic-Fomulars-in-Markdown/)**

In [27]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import scipy

# Load CSV File

In [28]:
# Read CSV files by making one column as index
# data_frame = pd.read_csv('dataset/circle_employee.csv', index_col='name') 

df = pd.read_csv('dataset/circle_employee.csv')
df.head()

Unnamed: 0,id,name,age,blood_group,gender,experience,designation,salary
0,1,Sharif,,B+,male,1.5,Jr Software Engineer,30000
1,2,Kanan Mahmud,28.0,,Male,7.5,Sr Software Engineer,80000
2,3,Md. Shakil,27.0,B-,Male,3.5,Software Engineer,45000
3,4,Imran Sheikh,25.0,B-,Male,1.8,Jr Software Engineer,30000
4,5,Farsan Rashid,27.0,O+,Male,4.2,Software Engineer,55000


# Create Dataframe

### Using List

In [29]:
dataframe = pd.DataFrame()

dataframe['Name'] = ['Sharif','Imran','Hanif','Akib','Fatin']
dataframe['Age'] = [26,24,np.nan,23,25]
dataframe['Blood Group'] = ['B+','O+','AB+','A+','O-']
dataframe['Sex Code'] = [1,1,1,1,1,]

dataframe

Unnamed: 0,Name,Age,Blood Group,Sex Code
0,Sharif,26.0,B+,1
1,Imran,24.0,O+,1
2,Hanif,,AB+,1
3,Akib,23.0,A+,1
4,Fatin,25.0,O-,1


### Using Dictionary

In [30]:
dct = {
    "name":['Sharif','Imran','Hanif','Akib','Fatin'],
    "Age":[26,24,np.nan,23,25],
    "Blood Group": ['B+','O+','AB+','A+','O-'],
    "Sex Code": [1,1,1,1,1,],
}
df = pd.DataFrame(dct)
df.head()

Unnamed: 0,name,Age,Blood Group,Sex Code
0,Sharif,26.0,B+,1
1,Imran,24.0,O+,1
2,Hanif,,AB+,1
3,Akib,23.0,A+,1
4,Fatin,25.0,O-,1


# append new rows to the bottom by SERIES

In [31]:
new_person = pd.Series(['Kanan Mahmud',30,'B+',1],
                       index=['Name','Age','Blood Group','Sex Code'])

dataframe.append(new_person, ignore_index=True)

Unnamed: 0,Name,Age,Blood Group,Sex Code
0,Sharif,26.0,B+,1
1,Imran,24.0,O+,1
2,Hanif,,AB+,1
3,Akib,23.0,A+,1
4,Fatin,25.0,O-,1
5,Kanan Mahmud,30.0,B+,1


# Describe Dataset

In [32]:
# Show dimensions
dataframe.shape

(5, 4)

In [33]:
# Show statistics
dataframe.describe()

Unnamed: 0,Age,Sex Code
count,4.0,5.0
mean,24.5,1.0
std,1.290994,0.0
min,23.0,1.0
25%,23.75,1.0
50%,24.5,1.0
75%,25.25,1.0
max,26.0,1.0


## Calculate Standard Daviation

$SD=\sqrt{\sum_{i=0}^N\frac{(x-\Phi)^2}{N}}$

- $\Phi$ is the mean
- N is Total Number of data
- SD Standard Daviation

# Select one or more rows or values

In [34]:
# Select first row
print(dataframe.iloc[0])

# Select three rows
dataframe.iloc[1:4]

# Select three rows
dataframe.iloc[:4]

Name           Sharif
Age                26
Blood Group        B+
Sex Code            1
Name: 0, dtype: object


Unnamed: 0,Name,Age,Blood Group,Sex Code
0,Sharif,26.0,B+,1
1,Imran,24.0,O+,1
2,Hanif,,AB+,1
3,Akib,23.0,A+,1


# Set Index For Data Frame
All rows in a pandas DataFrame have a unique index value. By default, this index is
an integer indicating the row position in the DataFrame; however, it does not have to
be. DataFrame indexes can be set to be unique alphanumeric strings or customer
numbers. To select individual rows and slices of rows, pandas provides two methods:
* **loc** is useful when the index of the DataFrame is a label (e.g., a string).
* **iloc** works by looking for the position in the DataFrame. For example, iloc[0] will return the first row regardless of whether the index is an integer or a label.

It is useful to be comfortable with both loc and iloc since they will come up a lot
during data cleaning.

Although DataFrames provide built in numerical index But We can set the index of a DataFrame to any value where the value is unique to each row.

In [35]:
# Set index
df = dataframe.set_index(dataframe['Name'])
print(dataframe)

# Show row
df.loc['Sharif']

     Name   Age Blood Group  Sex Code
0  Sharif  26.0          B+         1
1   Imran  24.0          O+         1
2   Hanif   NaN         AB+         1
3    Akib  23.0          A+         1
4   Fatin  25.0          O-         1


Name           Sharif
Age                26
Blood Group        B+
Sex Code            1
Name: Sharif, dtype: object

Unnamed: 0,Name,Age,Blood Group,Sex Code
0,Sharif,26.0,B+,1
1,Imran,24.0,O+,1
2,Hanif,,AB+,1
3,Akib,23.0,A+,1
4,Fatin,25.0,O-,1


# Selecting Rows Based on Conditionals
Conditionally selecting and filtering data is one of the most common tasks in data wrangling. You rarely want all the raw data from the source; instead, you are interested in only some subsection of it. For example, you might only be interested in stores
in certain states or the records of patients over a certain age.

In [54]:
# Show top two rows where column 'Blood Group' is 'B+'
print(dataframe[dataframe['Blood Group'] == 'B+'])

# multiple condition
dataframe[(dataframe['Name']=="Sharif") or (dataframe['Age'] >= '20')]

     Name   Age Blood Group  Sex Code
0  Sharif  26.0          B+         1


ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().