## Length of the report {-}
The length of the report must be no more than 15 pages, when printed as PDF. However, there is no requirement on the minimum number of pages.

You may put additional stuff as Appendix. You may refer to the Appendix in the main report to support your arguments. However, your appendix is unlikely to be checked while grading, unless the grader deems it necessary. The appendix, references, and information about GitHub and individual contribution will not be included in the page count, and there is no limit on the length of the appendix.

**Delete this section from the report, when using this template.** 

## Background / Motivation

What motivated you to work on this problem?

Mention any background about the problem, if it is required to understand your analysis later on.

Alcohol consumption is a phenomenon that has huge implications for policymakers, business leaders, and consumers. While alcohol is a part of everyday social life around the world, its abuse has large societal consequences, contributing to 3 million deaths each year and to 5.1% of the global burden of disease [WHO]. At the same time, alcohol also represents a huge opportunity for businesses and investors, who have to reconcile its harmful effects with its consistent and growing demand worldwide. The market for alcoholic beverages is estimated at roughly $1.7 trillion and has been growing by over 5% annually post-pandemic; there is especially high growth in countries like China and India as per-capita consumption and interest in premium options increases [Statistica]. As a result of its large and multifaceted impact, institutions across society want to better understand the drivers of alcohol consumption. 

Through the exploration of demographic, cultural, and political data, we seek to create a regression that can predict alcohol consumption in every country worldwide in a target year. Alcohol consumption levels are influenced by factors including economic development, culture, availability of alcohol, and implementation and enforcement of alcohol policies [WHO]. Beyond these, there may be outcomes affected by alcohol use, such as the impact of alcohol on health outcomes, or differences in consumption such as gender differences, that can better predict alcohol consumption in different contexts worldwide. This type of comprehensive model will not only generate consumption predictions for each country, but also allow stakeholders to target interventions through explaining relationships between various factors and alcohol consumption. 


## Problem statement 

Describe your problem statement. Articulate your objectives using absolutely no jargon. Interpret the problem as inference and/or prediction.

This regression will create predictions for per-capita alcohol consumption in every country, using cultural and demographic data from each country as predictors. We will use data in three years for which we have worldwide alcohol consumption data—2005, 2010, and 2015—and compare models generated by each to best predict consumption in 2019. Our primary aim is to accurately predict consumption in 2019; a secondary goal is inference, in understanding how each predictor affects alcohol consumption. We will compare the predicted and actual values of alcohol consumption in 2019 to determine which of the regressions from the three previous years best models worldwide alcohol consumption in 2019. 

## Data sources
The data that we used for our model and analysis is from this website: https://ourworldindata.org. This website provides yearly data from 1999-2019 for the average alcohol consumption per capita consumes in liters for every country in the world. We used this same website to get data for all of our predictors as well. 

## Stakeholders
Our data analysis has the potential to inform stakeholders across all sides of the business world as well as policy and research. 

When it comes to business and industry, key stakeholders are alcoholic product manufacturers and distributors, marketing agencies and consultancies with the former agents as clients, as well as individual investors and holding groups with investments in the alcohol industry. Quantitative insights from our analysis such as the predicted alcohol  consumption in the future can inform manufacturers in planning their production better, allocating resources and funding appropriately while alcohol retailers and vendors can improve their inventory planning and their decision-making around pricing. As for marketing agencies and consultancies, with the qualitative conclusions & recommendations derived from our research, which indicate specific qualities and demographics on the groups with higher alcohol consumption, they can optimize their consumer group targeting efforts and develop more robust strategy solutions for their clients, which could be alcohol-producing companies. For investors in the alcohol industry, our predictions on future alcohol consumption are vital in driving prospective investment decisions and understanding the current stability and picture of the market.

Regarding our analysis’ potential impact on the policy-making field, our predictions and qualitative insights can prove helpful to government, regulatory and other bodies with decision-making power in the public health policy space. As alcohol consumption at higher levels is a public health issue globally, predictions on future alcohol consumption have the ability to largely impact policy-makers across the world and provide ground for novel alcohol regulations where needed, as well as help understand where to focus public health campaigns.

Lastly, when it comes to research, our data analysis can help individual researchers and research institutions dedicated in areas such as  the effects of alcohol consumption on health and society to identify new intersections, potentially driving new areas of research or improving current research projects.


## Data quality check / cleaning / preparation 

In a tabular form, show the distribution of values of each variable used in the analysis - for both categorical and continuous variables. Distribution of a categorical variable must include the number of missing values, the number of unique values, the frequency of all its levels. If a categorical variable has too many levels, you may just include the counts of the top 3-5 levels. 

If the tables in this section take too much space, you may put them in the appendix, and just mention any useful insights you obtained from the data quality check that helped you develop the model or helped you realize the necessary data cleaning / preparation.

Were there any potentially incorrect values of variables that required cleaning? If yes, how did you clean them? 

Did you do any data wrangling or data preparation before the data was ready to use for model development? Did you create any new predictors from exisiting predictors? For example, if you have number of transactions and spend in a credit card dataset, you may create spend per transaction for predicting if a customer pays their credit card bill. Mention the steps at a broad level, you may put minor details in the appendix. Only mention the steps that ended up being useful towards developing your final model(s).

## Exploratory data analysis

Put the relevant EDA here (visualizations, tables, etc.) that helped you figure out useful predictors for developing the model(s). Only put the EDA that ended up being useful towards developing your final model(s). 

List the insights (as bullet points) you got from EDA that ended up being useful towards developing your final model. 

Again, if there are too many plots / tables, you may put them into appendix, and just mention the insights you got from them.

## Approach

We used a linear regression model to predict alcohol consumption for every country in the world. We decided to optimize MAE since we want to weigh all errors equally, instead of larger ones heavier since mispredicting the alcohol consumption does not present inherent danger. We initially performed EDA for every predictor against the response to find the best interactions and transformations. To do this, we created models and used the process of trial and error to see which ones worked best. After doing so, we eliminated irrelevant predictors and used the rest of them in the model improvement techniques. We anticipated that the elimination of predictors was going to be difficult as well as compiling a number of relevant predictors. To find them, we used a combination of extensive research and intuition. However, there may have been significant or relevant predictors that we may have missed including in the model since there are most likely many more that exist. We also anticipated that finding the variable interactions and transformations would be a challenge since there are a significant number of combinations possible that would best optimize the model. The first model that we tried did not work as we had a combination of all the predictors and transformations/interactions that we assumed were sufficient from EDA alone, however the model was not best fit. We then used different model improvement techniques such as best subset and forward/backward stepwise selection that helped us eliminate variables and find which ones would optimize our model. Our problem did not have a solution anywhere else online.

## Developing the model


To develop the model, we first used insights from visualizations by plotting each predictor against the response. This allowed us to see what transformations were necessary. To ensure those identified transformations were adequate, we plotted their fitted values against their residuals to address assumptions and violations of linearity. We then created models on those predictors with best transformations to see which transformations yielded the highest r^2 values. To identify useful interactions and address multicollinearity, we printed the correlation coeffecient matrix for all predictors and found ones with the highest correlations while testing them out with other predictors. Then, each person on our team tried different alternative fitting procedures to yield better prediction accuracy and model interpretability such as best subset selection, stepwise, lasso and ridge regression. We then all compared our insights from performing each technique to see which one improved our prediction accuracy best by looking at the BIC criterion for best subset and forward/backward stepwise. For lasso and ridge, we failed to identify any useful information since these regularization techniques have no capabilities for addressing multicollinearity, which our model employs. These regularization techniques also yielded low r^2 for our model, so we decided to discard these insights. The alternative fitting procedure that allowed us to identify useful predictors was forward stepwise selection as it was the most computationally effecient and yielded the best model fit. We used the BIC criterion to identify the most useful model. 


## Limitations of the model with regard to inference / prediction

If it is inference, will the inference hold for a certain period of time, for a certain subset of population, and / or for certain conditions.

If it is prediction, then will it be possible / convenient / expensive for the stakeholders to collect the data relating to the predictors in the model. Using your model, how soon will the stakeholder be able to predict the outcome before the outcome occurs. For example, if the model predicts the number of bikes people will rent in Evanston on a certain day, then how many days before that day will your model be able to make the prediction. This will depend on how soon the data that your model uses becomes available. If you are predicting election results, how many days / weeks / months / years before the election can you predict the results. 

When will your model become too obsolete to be useful?

## Conclusions and Recommendations to stakeholder(s)

Based on our model, we draw the conclusion that median age, education, percentage of females, and depression rate are the most significant predictors of alcohol consumption. Stakeholders in the marketing industry can use this information to target their campaigns towards countries with higher alcohol consumption levels, while those in the legal industry can use it to regulate alcohol consumption in countries with high depression and suicide rates since these factors also highly contribute to heavy alcohol usage. We suggest that stakeholders in the marketing industry develop targeted campaigns that focus on the identified predictors of alcohol consumption, such as age, education, gender, and depression rates. For instance, campaigns targeting younger people may be more effective in countries with lower median age, while campaigns that highlight the health risks associated with alcohol consumption may be more effective in countries with higher depression rates. For stakeholders in the legal industry, we recommend that they use the identified predictors to create policies that address the underlying causes of high alcohol consumption, such as depression rates, education levels, and gender disparities. For instance, implementing educational programs that promote responsible alcohol consumption or providing access to mental health services may be effective in reducing alcohol consumption in countries with higher depression rates. It's important to note that our model has certain limitations, and stakeholders should be aware of them before implementing our recommendations. For instance, our model may not be accurate in predicting alcohol consumption levels in all situations, and additional analysis or domain expertise may be required to make our recommendations practically implementable. Finally, our model can be used in the future to predict alcohol consumption levels, but stakeholders should consider updating the model periodically based on recent data to ensure its accuracy. The frequency of model updates may depend on various factors, such as changes in the underlying predictors of alcohol consumption or changes in the data used to train the model.

## GitHub and individual contribution {-}

https://github.com/sarahabdul2/STAT-303-2-Project-

Add details of each team member's contribution in the table below.

<html>
<style>
table, td, th {
  border: 1px solid black;
}

table {
  border-collapse: collapse;
  width: 100%;
}

th {
  text-align: left;
}
    

</style>
<body>

<h2>Individual contribution</h2>

<table style="width:100%">
     <colgroup>
       <col span="1" style="width: 15%;">
       <col span="1" style="width: 20%;">
       <col span="1" style="width: 50%;">
       <col span="1" style="width: 15%;"> 
    </colgroup>
  <tr>
    <th>Team member</th>
    <th>Contributed aspects</th>
    <th>Details</th>
    <th>Number of GitHub commits</th>
  </tr>
  <tr>
    <td>Nathan Jung </td>
    <td>Data cleaning/EDA/Regularization/Model Development/ Addressed autocorrelation</td>
    <td>Cleaned data to find useful interactions/transformation, performed ridge regression to find useful predictors, and contributed to developing the final model, and addressed autocorrelation.</td>
    <td>146</td>
  </tr>
  <tr>
    <td>Christina Tzavara</td>
    <td>Data cleaning/EDA/Variable Selection/Model Development</td>
    <td>Cleaned data, performed transformations, performed Forward/Backward Stepwise, and contributed to final model development.</td>
    <td>101</td>
  </tr>
    <tr>
    <td>Sarah Abdulwahid</td>
    <td>Data cleaning/EDA/Regularization</td>
    <td>Cleaned data to find useful interactions and transformations and performed Lasso regression to improve model fit.</td>
    <td>146</td>    
  </tr>
    <tr>
    <td>Vaynu Kadiyali</td>
    <td>Data cleaning/EDA/Model selection</td>
    <td>Cleaned data to find useful interactions and transformations. Performed best subset selection technique to improve model fit. </td>
    <td>143</td>    
  </tr>
</table>

Collaboration was quite difficult on GitHub. It was particularly troublesome while simultaneously working on a file and having to manually resolve many conflicts that would arrive with merging.

## References {-}

List and number all bibliographical references. When referenced in the text, enclose the citation number in square brackets, for example [1].

[1] Authors. The frobnicatable foo filter, 2014. Face and Gesture submission ID 324. Supplied as additional material
fg324.pdf. 3


## Appendix {-}

You may put additional stuff here as Appendix. You may refer to the Appendix in the main report to support your arguments. However, the appendix section is unlikely to be checked while grading, unless the grader deems it necessary.