<a href="https://colab.research.google.com/github/rileythejones/DS-Unit-2-Applied-Modeling/blob/master/Applied_Modeling_Day_One_Yelp_Dataset_Riley_Jones.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency > 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

# Exploration 

In [0]:
import pandas as pd
import pandas_profiling

# Read a sample Yelp Users from 2004 - 2018 in Southeast US and Canada
# Here I've sampled 30,000 out of 1.3 million. 
df = pd.read_csv('yelp30kuser.csv')

# Get Pandas Profiling Report
# df.profile_report()

In [14]:
import plotly.express as px
# what's been commmented out are a lot of notes to myself 
# df.describe(exclude='number')
# df.describe(include='number')
# df['average_stars'].describe()
# pd.qcut(df['average_stars'], q = 10, duplicates='drop').value_counts(normalize=False)
# df['average_stars'].value_counts(normalize=False)
# df['average_stars'].nunique()

"""random = df['category'].str.contains('thing')
df.loc[random, 'category'] = 'Random Things'
                                    """
# drop high cardinality categoricals, you can add them in back later 

"random = df['category'].str.contains('thing')\ndf.loc[random, 'category'] = 'Random Things'\n                                    "

In [0]:
def engineer(X):
    """A function to engineer the training, validation and test datasets in the same way"""
    # Making a copy as not to modify the original dataset 
    X = X.copy()
    
    # Format this column into a datetime type to extract year, month, and day 
    X['yelping_since'] = pd.to_datetime(X['yelping_since'])
    X['user_created_year'] = X['yelping_since'].dt.year
    X['user_created_month'] = X['yelping_since'].dt.month
    X['user_created_day']= X['yelping_since'].dt.day 
    X = X.drop(columns='yelping_since') # drop original 
    
    # Convert this column from a float into an integer value
    # Since floats cannot be used as targets in a model 
    X['target_star'] = X['average_stars'].astype(int)
    
    # X['review_count_bin'] = pd.qcut(X['review_count', q=10)
    
    # There's no spaces in the column names but this code might be useful anyway                                
    X.columns = [col.replace(' ', '_') for col in X]
    
    return X

In [0]:
# Engineer the data to work with plots 
df = engineer(df)

In [17]:
px.histogram(df, x='average_stars', color='user_created_year', range_x=[0.9, 5.1], nbins=50, title=("A User's Average Star Rating from 2004 to 2018"))


In [18]:
low = df.query('review_count <= 4')
px.histogram(low, x='average_stars', color='user_created_year', range_x=[0.9, 5.1], nbins=50, title=("A User's Average Star Rating "))


In [20]:
px.histogram(df, x='average_stars', range_x=[0.9, 5.1], nbins=50, title=("A User's Average Star Rating "))


In [22]:
fig = px.histogram(df, x="review_count", color="target_star", range_x=[0, 100])
fig.show()

In [0]:
y = df['average_stars']
import seaborn as sns 
sns.distplot(y);

In [23]:
# Split the dataframe into training and validation sets 
from sklearn.model_selection import train_test_split
training, validation = train_test_split(df, test_size =0.1, shuffle=True)
training.shape, validation.shape

((27000, 26), (3000, 26))

In [25]:
# Engineer and separate X and y 
train = engineer(training)
val = engineer(validation)

target = 'review_count'

X_train = train.drop(columns=target)
y_train = train[target]

X_val = val.drop(columns=target)
y_val = val[target]

KeyError: ignored

In [0]:
# Making a Pipeline for Model Selection 
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='median'), 
    RandomForestClassifier(n_estimators=50, random_state=42)
)

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_val)
print('Validation Accuracy', accuracy_score(y_val, y_pred))

In [26]:
df['review_bin'] = pd.qcut(df['review_count'], q=8, duplicates='drop')
df['review_bin'][:10]


0      (3.0, 5.0]
1      (3.0, 5.0]
2    (15.0, 36.0]
3    (15.0, 36.0]
4      (5.0, 8.0]
5      (2.0, 3.0]
6    (0.999, 2.0]
7      (3.0, 5.0]
8    (0.999, 2.0]
9     (8.0, 15.0]
Name: review_bin, dtype: category
Categories (7, interval[float64]): [(0.999, 2.0] < (2.0, 3.0] < (3.0, 5.0] < (5.0, 8.0] < (8.0, 15.0] <
                                    (15.0, 36.0] < (36.0, 6407.0]]

In [27]:
px.histogram(df, x='average_stars',
             range_x=[0.9, 5.1], range_y=[0, 1200], 
             animation_frame='user_created_year',
             category_orders={"user_created_year":
                              [2004, 2005, 2006, 2007, 2008,
                               2009, 2010, 2011, 2012, 2013,
                               2014, 2015, 2016, 2017, 2018]},
             )


