# 06 Data Wrangling

<p>This may include further <a href="/wiki/Mung_(computer_term)" title="Mung (computer term)">munging</a>, <a href="/wiki/Data_visualization" title="Data visualization">data visualization</a>, data aggregation, training a <a href="/wiki/Statistical_model" title="Statistical model">statistical model</a>, as well as many other potential uses.  Data munging as a process typically follows a set of general steps which begin with extracting the data in a raw form from the data source, "munging" the raw data using algorithms (e.g. sorting) or parsing the data into predefined data structures, and finally depositing the resulting content into a data sink for storage and future use.<sup id="cite_ref-eduunix_1-0" class="reference"><a href="#cite_note-eduunix-1"></a></sup>
</p>

In [1]:
import pandas as pd

In [2]:
import os
from pathlib import Path 

customer_churn_dataset = Path(os.path.abspath(os.path.curdir)) / 'data' / 'customer-churn-model' / 'Customer Churn Model.txt'

In [3]:
data = pd.read_csv(customer_churn_dataset)

In [4]:
data.head()

Unnamed: 0,State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,...,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False.
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False.
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False.
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False.
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False.


### Create a subset of data

#### Subset of a single Series

In [5]:
account_length = data["Account Length"]

In [6]:
account_length.head()

0    128
1    107
2    137
3     84
4     75
Name: Account Length, dtype: int64

In [7]:
type(account_length)

pandas.core.series.Series

In [8]:
subset = data[["Account Length", "Phone", "Eve Charge", "Day Calls"]]

In [9]:
subset.head()

Unnamed: 0,Account Length,Phone,Eve Charge,Day Calls
0,128,382-4657,16.78,110
1,107,371-7191,16.62,123
2,137,358-1921,10.3,114
3,84,375-9999,5.26,71
4,75,330-6626,12.61,113


In [11]:
type(subset)

pandas.core.frame.DataFrame

In [None]:
desired_columns = ["Account Length", "Phone", "Eve Charge", "Night Calls"]
subset = data[desired_columns]
subset.head()

In [None]:
desired_columns = ["Account Length", "VMail Message", "Day Calls"]
desired_columns

In [None]:
all_columns_list = data.columns.values.tolist()
all_columns_list

In [None]:
sublist = [x for x in all_columns_list if x not in desired_columns]
sublist

In [None]:
subset = data[sublist]
subset.head()

#### Subset of Rows - Slicing

The operation of selecting multiple rows in the Data Frame is sometimes called Slicing

In [None]:
data[1:25]

In [None]:
data[10:35]

In [None]:
data[:8] # equivalent to data[1:8]

In [None]:
data[3320:]

#### Row Slicing with boolean conditions

In [None]:
# Selecting values with Day Mins > 300
data1 = data[data["Day Mins"]>300]
data1.shape

In [None]:
# Selecting values with State = "NY"
data2 = data[data["State"]=="NY"]
data2.shape

In [None]:
## AND -> &
data3 = data[(data["Day Mins"]>300) & (data["State"]=="NY")]
data3.shape

In [None]:
## OR -> |
data4 = data[(data["Day Mins"]>300) | (data["State"]=="NY")]
data4.shape

In [None]:
data5 = data[data["Day Calls"]< data["Night Calls"]]
data5.shape

In [None]:
data6 = data[data["Day Mins"]<data["Night Mins"]]
data6.shape

In [None]:

subset_first_50 = data[["Day Mins", "Night Mins", "Account Length"]][:50]
subset_first_50.head()

In [None]:
subset[:10]

#### Filtrado con ix -> loc e iloc

In [None]:
data.iloc[1:10, 3:6] ## Primeras 10 filas, columnas de la 3 a la 6

In [None]:
data.iloc[:,3:6] # all rows, third to sixth columns
data.iloc[1:10,:] # All cols, rows from 1 to 10

In [None]:
data.iloc[1:10, [2,5,7]]  # selecting specific columns

In [None]:
data.iloc[[1,5,8,36], [2,5,7]]

In [None]:
data.loc[[1,5,8,36], ["Area Code", "VMail Plan", "Day Mins"]]

## Inserting new colums in a Data Frame

In [None]:
data["Total Mins"] = data["Day Mins"] + data["Night Mins"] + data["Eve Mins"]

In [None]:
data["Total Mins"].head()

In [None]:
data["Total Calls"] = data["Day Calls"] + data["Night Calls"] + data["Eve Calls"]

In [None]:
data["Total Calls"].head()

In [None]:
data.shape

In [None]:
data.head()