### Recoding!

### Tips and things to remember when preparing for analysis:

    - When performing analysis like random forest and decision trees, the variables need to be numeric so 
      categorical data must be recoded.
    - Use .info() to view your variable types.
    - Use .value_counts() to see what needs to be recoded inside each column.

### Remember to explore the NA values in your dataset!

    - Analysis cannot be performed with NA values. 
    - First assess them, how many do you have? In theory, if NA values make up 20% or less of your data it is 
      usually ok to drop them. Yet many factors need to be considered in dropping NA values. 
    - Consider if you could impute the mean or most frequent value.
    - How you decide to deal with NA values will depend on the nature of your dataset and what you are trying 
      to do:)

In [1]:
# Watch the tutorial video on recoding that goes along with this notebook!
from IPython.display import VimeoVideo
VimeoVideo('781109741', width=720, height=480)

In [2]:
import pandas as pd
import seaborn as sns

In [3]:
tips = sns.load_dataset("tips")

In [4]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [5]:
tips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB


In [6]:
tips.day.value_counts()
#This shows us what levels or options are within the day column of the tips dataset.

Sat     87
Sun     76
Thur    62
Fri     19
Name: day, dtype: int64

In [7]:
def RECODE (series):
    if series == "Sat":
        return "1"
    if series == "Sun":
        return "2"
    if series == "Thur":
        return "3"
    if series == "Fri":
        return "4"
    
tips['dayR'] = tips['day'].apply(RECODE)
# This is creating a recode for the day column in the tips dataset. We start by defining a function by starting 
# with def then we name the recode RECODE followed by (series): When we hit enter the line should indent to start
# the if statements for our conditions. It is a good idea to go from top to bottom of the results from our value_
# counts() code so that we do not leave anything out. For each condition we will do if series == "day_here":
# and do that for each day we had in the quotes followed by a colon. 

# The very last line is showing that we are creating a new column dayR in the tips dataset and after the equal 
# sign is saying that we are creating that column from using the day column in the tips dataset and the .apply()
# function is saying to apply the RECODE we just defined to the day column and put that information inside our
# new column of dayR!

In [8]:
tips.head()
#Now we can see our new column dayR

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,dayR
0,16.99,1.01,Female,No,Sun,Dinner,2,2
1,10.34,1.66,Male,No,Sun,Dinner,3,2
2,21.01,3.5,Male,No,Sun,Dinner,3,2
3,23.68,3.31,Male,No,Sun,Dinner,2,2
4,24.59,3.61,Female,No,Sun,Dinner,4,2


In [9]:
tips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
 7   dayR        244 non-null    category
dtypes: category(5), float64(2), int64(1)
memory usage: 7.8 KB


In [10]:
tips['dayR'] = tips['dayR'].astype(int)
#Changing our column to numeric by changing it to an integer to prep for analysis

In [11]:
tips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
 7   dayR        244 non-null    int64   
dtypes: category(4), float64(2), int64(2)
memory usage: 9.3 KB


In [12]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,dayR
0,16.99,1.01,Female,No,Sun,Dinner,2,2
1,10.34,1.66,Male,No,Sun,Dinner,3,2
2,21.01,3.5,Male,No,Sun,Dinner,3,2
3,23.68,3.31,Male,No,Sun,Dinner,2,2
4,24.59,3.61,Female,No,Sun,Dinner,4,2


In [13]:
tips.isna().sum()
#Viewing the NA values in each column of the dataset

total_bill    0
tip           0
sex           0
smoker        0
day           0
time          0
size          0
dayR          0
dtype: int64

In [14]:
tips2 = tips[['total_bill', 'tip', 'size', 'dayR']]
#Creats a new dataset with only the selected columns

In [15]:
tips2.head()

Unnamed: 0,total_bill,tip,size,dayR
0,16.99,1.01,2,2
1,10.34,1.66,3,2
2,21.01,3.5,3,2
3,23.68,3.31,2,2
4,24.59,3.61,4,2


In [16]:
#Hope this was helpful for you! <3 Mia