This is a comprehensive data science project for the Final of WISERCLUB 2019-2020.
The project is about Business Analytics and Data Mining. It consists of three parts:
Part 1: Explorative Data Analysis
Part 2: Data Preprocessing
Part 3: Model Training and Prediction
Each part has seversal problems. (You can see the problems in Contents.) We have got two csv files, named data.csv
and holiday.csv, derived from a new retail specialty coffee operator. The task is to use data and models to find
hidden information.
For SECURITY reasons, the files with extension .csv will not be uploaded.
pandas, numpy, matplotlib, scipy, math, datetime, sklearn, xgboost, imblearn
Aggregate Functions (groupby in Pandas), Hypothesis Testing (T test, F test), String Format, Lambda Expression,
Adaboost, Random Forest, Cross Validation, Xgboost, GridSearchCV, Oversampling, Undersampling
-
Find the time span of the order data.
-
Find the number of orders each day.
a. Boss: we need to design two different strategies for sales in workdays and sales in weekends.
True or False? Explain. -
Find the number of users.
-
Find ten commodities with the highest sales and draw graphs with x-axis the commodity name and y-axis
the # of orders. -
Find the discount rate of each order and concat it onto the original dataset with column name discount_rate.
You may use pay_money, coffeestore_share_money, commodity_origin_money and commodity_income. -
Find the average discount of each week. One week should consist of Sunday to Saturday.
-
Find the Retention Rate of any five days. It is the ratio of users purchasing again on the next day.
For example, if you want to compute the Retention Rate on 2019-02-10, then you need to find users who
bought goods on 02-09 and 02-10. -
Find the Week Retention Rate of any day, which means finding users buying at that day and buying again
within the next seven days. -
Find the Week Retention Rate of any day for new users , which means finding users buying at that day
for the first time and buying again within the next seven days. -
Find the Retention Rate WITHIN one week of new users. You could choose any week you want, but it must consist of Sunday to Saturday. You need to find users buying the first product and buying again within that week.
-
Find “Active Users” (which means the number of orders of one user is greater equal to 5).
-
Write the table you get in 11 as a csv file with filename ActiveUser.csv.
-
Provide a description of the number of orders for each active user (# of ActiveUser, mean, range, std, variance,
skewness and kurtosis).
-
Remove the first column of the data in data.csv , because it is just a copy of index.
-
Boss: To implement Collaborative Filtering in recommendation systems, we need a user-item table to show the number of orders for each user and each item.
Try to construct user-item table. An example of user-item pair: (Phone_No, 标准美式) -
Boss: Life is not like a Markov Chain, which means everyone's past behavior is correlated with his present one.
And that is why we could exploit past purchase behavior to predict their future buying trends.
Try to construct a dataset to show this past purchasing behavior trend. For convenience, several instructions are proposed as follows
a. Two days correspond to one dimension.
b. The last two days of the time span of the data should be the future , which means it corresponds to the target field for the following data mining models.
c. The length of each user vector must be maximized.
d. The dataset should be a DataFrame in Pandas, so you could customize the columns as you wish.
For example, if the time span is from 2019-02-01 to 2019-02-10, then there are 10 days altogether. So each user
corresponds to a 5-dimensional vector, with 4 features and 1 target dimension. The vector [4, 0, 0, 0, 1] means
this user bought one good between 02-09 and 02-10, and four goods between 02-01 and 02-02. Additionally,
the length of each user vector MUST BE 5 because of the rule c.
-
Transform the data you got from the last section into an array in Numpy.
-
Split the data into features X and targets Y.
-
Use Adaboost, Random Forest in Sklearn to construct the model for prediction with 3-fold cross validation
a. (Optional) Use Xgboost.
b. Boss: Please do not use Naive-Bayes or Support Vector Machine in this project.
True or False? Explain. -
Tune your model and report the best metrics you could get for your model and the corresponding confusion matrix and model name.
At least Adaboost and Random Forest should be used for tuning. Here are some suggestions.
a. Try to do oversampling or undersampling. This is an imbalanced classification problem.
b. Change the parameters of each model (e.g. scale_pos_weight in Xgboost and probability threshold), more
information could be found in the Official Documentations.
c. Accuracy is not suitable to be an evaluation metric in this case. Use F1-measure.
d. Try to not record the # of orders for each user. Record whether he bought the goods instead, 1 if he bought
and 0 otherwise.
e. Try to record the active-user feature. Many users did not only buy one cup of drink during two days, so whether one user is active should be taken into consideration.
f. Try to split the data with respect to Workdays and Weekends and train two different models. If that is the best choice, then you should report two metrics, one for Workdays Model and the other one for Weekends Model. -
After tuning, try to explain why your model works better.