Often the data you are using will not have balanced classes (e.g., there will be 9K examples of valid email and 1K examples of spam). This is a problem. Your model will just predict valide email everytime...and it will be 90% accuarate. Woohoo? Nope.  

You'll need to consider rebalancing your classes (i.e., taking a 1K sample of valid email, or sampling w/replacement to create 9K examples of spam.  

Below is a process that will help expidite the process of rebalancing your classes.

## Make fake data

In [2]:
import pandas as pd

In [1]:
#create some fake data
ones = [1 for x in range(20)]
twos = [2 for x in range(13)]
threes = [3 for x in range (7)]
review_ratings = {'ratings': ones + twos + threes}

In [3]:
#make df out of data
df = pd.DataFrame.from_dict(review_ratings)

In [7]:
#look at data
df.head()

Unnamed: 0,ratings
0,1
1,1
2,1
3,1
4,1


In [8]:
#get counts by rating
df['ratings'].value_counts()

1    20
2    13
3     7
Name: ratings, dtype: int64

## Rebalance Classes

In [9]:
#sample to create balanced data
def df_sample_replace(df, column, target, n, replace=True):
    df = df[df[column] == target]
    df_sample = df.sample(n=n, replace=replace)
    
    return df_sample

In [11]:
#sample from classes that are imbalanced
a,b = (df_sample_replace(df,'ratings',1,7, replace=False)
    , df_sample_replace(df,'ratings',2,7, replace=False))

In [13]:
#create new df of balanced classes
c = df[df['ratings'] == 3]
df_balanced = pd.concat([a,b,c])
df_balanced['ratings'].value_counts()

3    7
2    7
1    7
Name: ratings, dtype: int64