# Your Final Project Title by *Your Name

## 1. Problem Statement, Motivation, Research Goals: 

This should be precise and right to the point.

## 2. Data Source and Description: 

Where/how do you get the data?

## 3. Literature Review and References: 

This could include noting any key papers, texts, or websites
that you have used to develop your modeling approach, as well as what others have
done on this problem in the past. You must properly credit sources.

## 4. Preliminary EDA: 

    a) What is the shape of your data set - how many rows (observations), and how many columns (variables/features)?
    b) What are the names of these variables, and its full descriptions?
    c) How many numerical variables?
    d) How many categorical variables?
    e) How many text variables?
    f) What variables other than the above are involved?
    g) What methods have you used to preprocess/clean/explore the data and why? 

You will provide related visualizations, summary statistics, and verbal descriptions.

## 5. Modeling Process (the main portion): 

Provide a reproducible modeling process (with all codes and comments, from Data to Models) of fitting 

     a) an initial baseline (simple) model for comparison 
     b) a set of competitive feasible models
     c) a final model of your best choice

## 6. Project Progress, Timeline, and Achievement: 

Briefly report your progress with a clear timeline (in terms of the number of weeks of the semester) and summarize your project achievement step-by-step that you have made along the way from beginning to end. 

This timeline will tell how you arrived at your results and powerfully illustrate your efforts in the due process.
How well does your model and/or implementation perform? Did you meet your goals? What are the significance of your results?

### 7. Conclusions and Possible Future Work: 

Summarize the strengths and weaknesses of your results, speculate on how you might address these short-comings, and plan for further research directions if given time.


# Example 1 - Movie classification using plot descriptions by *Your Name

## 1. Problem Statement, Motivation, Research Goals: 

In this project I am going to make a machine learning pipeline from scratch to solve the problem
of movie classification using their plot descriptions. I found the dataset by scraping which is a common
practice in most data science/machine learning projects.
Below are explained some of the data resources I can scrape the data from. Starting from the standard bag-of-words representation I used EDA to identify an
appropriate input representation, perform feature engineering on it and find what is a good way
to represent movie descriptions. The choice of the model is strongly dependent on the choice of
the input representation. 

Concrete goals:
1. Scrape a dataset of about 1000 movies
2. Try bag-of-words and word2vec representations for movie descriptions
3. Try a naive-bayes classifier and an SVM classifier for both of these text features.
4. Answer interesting questions about movie genres and their plots.
5. Use conventional machine learning algorithms (not deep nets) to learn a classifier which
takes in text as a bag-of-words representation.
7. Explore other text representations like word2vec [2] and GloVE [3] word embeddings to
see how the right representation affects the performance. This is an especially important
lesson for non deep models.
8. To be able to take control of the whole machine learning pipeline, and create your own
personal version of a project like [1]

## 2. Data Source and Description: 

1. Data sources for labels (movie genres)-
- TMDB:Afree, open-source dataset of movie information (https://www.themoviedb.org/?language=en).
You will need to create and account to obtain an API key to download information.
You can use the library ‘tmdbsimple’ for making easy API calls.
- IMDB: The standard database for movie information. You can use the library ‘imdb’
to get data from imdb. Now that we know how to get information from TMDB, here’s
how we can get information about the same movie from IMDB. This makes it possible
for us to combine more information, and get a richer dataset. Due to the dierences
between the two datasets, you will have to do some cleaning, however both of these
datasets are extremely clean and it will be minimal.
2. Data sources for movie reviews -
- TMDB, IMDB (as above)
- Wikipedia: Most movies have a wiki page which contains a "plot" section. You can
scrape movie description data from here.

## 3. Literature Review and References: 

[1] Spandan Madan. Spandan-Madan/DeepLearningProject: First release of the Deep Learning
Project, July 2017.

[2] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations
of words and phrases and their compositionality. In Advances in neural information
processing systems, pages 3111–3119, 2013.

[3] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for
word representation. In Proceedings of the 2014 conference on empirical methods in natural language
processing (EMNLP), pages 1532–1543, 2014.

## 4. Preliminary EDA: 

    a) What is the shape of your data set - how many rows (observations), and how many columns (variables/features)?
    b) What are the names of these variables, and its full descriptions?
    c) How many numerical variables?
    d) How many categorical variables?
    e) How many text variables?
    f) What variables other than the above are involved?
    g) What methods have you used to preprocess/clean/explore the data and why? 

You will provide related visualizations, summary statistics, and verbal descriptions.

## 5. Modeling Process (the main portion): 

Provide a reproducible modeling process (with all codes and comments, from Data to Models) of fitting 

     a) an initial baseline (simple) model for comparison 
     b) ...


# Example 2 - Home Value Predictions in Greater Boston Area by *Your Name

## 1. Problem Statement, Motivation, Research Goals: 

A home is often the largest and most expensive purchase a person makes in his or her lifetime.
Ensuring homeowners have a trusted way to monitor this asset is incredibly important. While
many individual homebuyers are less sensitive to price forecasts, small errors in price prediction
can have systemic negative effects in the economy as a whole. Accurate prediction makes it easier
to understand which features would influence the final property price.

Real estate prices are very much dependent on factors that are not easy to control. Analyzing
broader market conditions and specific property determinants in order to establish how property
values may change over the course of time are utterly important. Massive data can be obtained
about the current market situation, which demands using powerful machine learning algorithms
in order to predict with high precision and in a reasonable time frame.

Goal: Build a home valuation algorithm from the ground up, using both, internal and external
data sources. Compare final predictions against other popular algorithms estimated values.

High-level Project Goals
    
    Propose a method that effciently handles all the factors in residential real estate. Train/test
    your model on 2016/2017 data and report your results.
        
    Include the interior information about the property into your model (property images). Does
    including images improve the predictive power of your model?
    
    Include exterior information (Google images). How does including Google images aect
    your predictions?
    
    Your final milestone will be predicting the prices based on January 2018 data. Use the above
    mentioned steps and derive the best possible combination (features/models) to compare
    your predictions with real prices. How precisely will you be able to predict days on the
    market and price for every unit?

## 2. Data Source and Description: 

In this project I used a list of real estate properties in a Greater Boston
Area. The dataset consists of information about 2016 and 2017 properties and contains a vast
amount of features - information about the property, historical data, taxes, exclusie brokers etc.
Additionally, you will be provided with the dataset which includes listings from 1/1/2018 until
1/31/2018 (approximately 23000 properties) to test your model at the end of the tracking period,
that will align with the end of your final project.
Additional data challenges

I will scrape the historical values on a properties for 2016/2017 data
    and compare predictions against both, estimated values and actual sale prices at the time.
    
In the real estate industry, pictures can easily tell people how the house looks like. I will use the information from provided dataset (street name, street number,
    and zip code) to scrape properties descriptions and pictures (some listings have walkthrough
    videos that could be taken into an account).

For the given house pictures, people can easily have an overall feeling of the house - what is
    the construction style, how the neighboring environment looks like etc. Use Google Street
    View image API to submit address and scrape Google images about the building (exterior
    properties, street quality, neighborhood, near-by attractions).

## 3. Literature Review and References: 

1. Zillow Kaggle Competition: https://www.kaggle.com/c/zillow-prize-1
2. UsingPython to scrape Google Street images:https://andrewpwheeler.wordpress.com/2015/12/28/usingpython-
to-grab-google-street-view-imagery/
3. A. V. Dorogush, A. Gulin, G. Gusev, N. Kazeev, L. Ostroumova Prokhorenkova, A. Vorobev,
Fighting biases with dynamic boosting, CoRR, 2017.
4. T. Chen and C. Guestrin, XGBoost: A Scalable Tree Boosting System, In Proceedings of the
22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(KDD ’16). ACM, New York, NY, USA, 785-794, 2016.
5. Q. You, R. Pang, L. Cao and J. Luo, Image-Based Appraisal of Real Estate Properties, in IEEE
Transactions on Multimedia, vol. 19, no. 12, pp. 2751-2759, 2017.

## 4. Preliminary EDA: 

    a) What is the shape of your data set - how many rows (observations), and how many columns (variables/features)?
    b) What are the names of these variables, and its full descriptions?
    c) How many numerical variables?
    d) How many categorical variables?
    e) How many text variables?
    f) What variables other than the above are involved?
    g) What methods have you used to preprocess/clean/explore the data and why? 

You will provide related visualizations, summary statistics, and verbal descriptions.

## 5. Modeling Process (the main portion): 

Provide a reproducible modeling process (with all codes and comments, from Data to Models) of fitting 

     a) an initial baseline (simple) model for comparison 
     b) ...


# Example 3 - Cryptocurrency Market Analysis and Prediction by *Your Name

## 1. Problem Statement, Motivation, Research Goals: 

Cryptocurrencies such as Bitcoin, Ethereum, etc. generated significant attention in 2017. Cryptocurrencies
have significant volalility as there is rampant speculation. Given the high variance in
prices, can data science methods be used to model the market dynamics?

Can an effective trading strategy be found? Try exploring different strategies.
We are looking for a demonstration of sound data science principles
here.

Market Analysis Given there is now option trading on certain cryptocurrencies, is it possible to
create a volatility index for cryptocurrencies such as VIX? Is variance of this market infinite
and therefore not predictable? Are there any rational reasons for investing that you can
justify using data science?
Arbitrage Given the number of different currencies and different markets, how effcient is the
market? Are there arbitrage opportunities? Can evidence be found of arbitrage?
Your own find some other area that a team would like to explore. This could include using CNN’s
to analyze up-to-the-minute market plots to find interesting patterns, analysis of the ICO
market, etc. Discuss with your TF before choosing a new direction. I will need to justify
access to data, complexity, and expected hypotheses.

Project goal
Apply well-reasoned data science methods to analyze a specific aspect of the cryptocurrency space.
There are many examples of simple attempts to throw deep learning models at cryptocurrency
historical prices. The vast majority lack scientific rigor and analysis of variance. We are looking
for you to go beyond these feeble attempts and take a rigorous approach to determine if this is a
solvable problem and give justifications to your hypotheses.

High-level project goals
1. Analyze the trading data using traditional and novel market analysis metrics.
2. Build a series of models, including ideally a deep learning model such a RNN, to predict a
specifc aspect such as pricing, volatility, market volume
3. Choose the best model, justify your choice, and describe strengths and limitations of the
chosen model.


## 2. Data Source and Description: 

### a) Kaggle Cryptocurrency Historical Prices
www.kaggle.com/sudalairajkumar/cryptocurrencypricehistory
- Historical daily pricing data of the most common cryptocurrencies (Bitcoin, Ethereum,
Ripple, etc.). Includes open, high, low, close, volume, market cap.
- Data up to November 2017
- Can integrate other sources, such as Blockchain Info or Etherscan

### b) Bitcoin Trades to the minute
www.kaggle.com/mczielinski/bitcoin-historical-data
- Minute by minute trading data on Bitcoin from various exchanges.
- Data from January 2012 until January 2018.
- Useful for trading strategies and perhaps arbitrage, but not recommended for long
term market dynamics.

### c) Bitcoin Blockchain in BigQuery
cloud.google.com/blog/big-data/2018/02/bitcoin-in-bigquery-blockchain-analytics-on-publicdata
- All historical bitcoin transactions in an easy to query, very fast format using Google Cloud

## 3. Literature Review and References: 
1. Financial forecasting with probabilistic programming and Pyro
2. Predicting Cryptocurrency PricesWith Deep Learning
3. Learning to trade Cryptocurrencies with Reinforcement Learning

## 4. Preliminary EDA: 

    a) What is the shape of your data set - how many rows (observations), and how many columns (variables/features)?
    b) What are the names of these variables, and its full descriptions?
    c) How many numerical variables?
    d) How many categorical variables?
    e) How many text variables?
    f) What variables other than the above are involved?
    g) What methods have you used to preprocess/clean/explore the data and why? 

You will provide related visualizations, summary statistics, and verbal descriptions.

## 5. Modeling Process (the main portion): 

Provide a reproducible modeling process (with all codes and comments, from Data to Models) of fitting 

     a) an initial baseline (simple) model for comparison 
     b) ...



# Example 4 - Predicting Success on Third Downs in NFL Football by *Your Name

## 1. Problem Statement, Motivation, Research Goals: 

     Third down and long is the toughest situation for any offensive coordinator in the NFL. –
     Matthew Stafford

Success in the NFL often comes down to consistent achievement in key but common situations.
One of the most important tactical elements in NFL football is being able to convert a third down
into a first down (or better). Several analytic studies have explored strategic considerations on
fourth downs, but we are not aware of studies that have addressed success on third downs. In
most third down plays (but not all), a team’s offense tries to achieve a first-down conversion; if
they fail, their options on fourth down are limited and much of the time the team will punt the ball.

The goal of this project is to obtain play-by-play NFL data to examine, predict, and summarize
success on third downs. Situational features of the game are certain to play an important role: The
number of yards short of the first down marker, where is the offense on the field, current score of
the game, time remaining in the game, identity of key players such as the quarterback, running
back, and so on. What types of plays are run are what are their success probabilities? What advice
can you give coaches based on your conclusions?

Key Challenges: Scraping data, wrangling data, visualization, statistical modeling, machine learning.

## 2. Data Source and Description: 

You will be required to collect play-by-play data for NFL games. Some resources for play-by-play
data include

 https://www.pro-football-reference.com/boxscores/201709100was.htm#all_pbp

 http://www.espn.com/nfl/playbyplay?gameId=400951760

 https://www.kaggle.com/maxhorowitz/nflplaybyplay2009to2016/data

It may help to use the nflscrapR R package. Details of the package may be found at https:
//github.com/maksimhorowitz/nflscrapR. A short example of its use can be found at https:
//www.r-bloggers.com/nfl-series/.

## 3. Literature Review and References: 
 
1. Berri, D. J., & Burke, B. (2012). Measuring productivity of NFL players. In The economics of
the National Football League (pp. 137-158). Springer New York.
2. Fokoue, E., & Foehrenbach, D. (2013). A Statistical Data Mining Approach to Determining
the Factors that Distinguish Championship Caliber Teams in the National Football League.
3. Onwuegbuzie, A. J. (2000). Is Defense or Oense More Important for Professional Football
Teams? A Replication Study Using Data from the 1998-1999 Regular Football Season.
Perceptual and motor skills, 90(2), 640-648.
4. Clevenson, M. L., &Wright, J. (2009). Go For It: What to consider when making fourth-down
decisions. Chance, 22(1), 34-41.

## 4. Preliminary EDA: 

    a) What is the shape of your data set - how many rows (observations), and how many columns (variables/features)?
    b) What are the names of these variables, and its full descriptions?
    c) How many numerical variables?
    d) How many categorical variables?
    e) How many text variables?
    f) What variables other than the above are involved?
    g) What methods have you used to preprocess/clean/explore the data and why? 

You will provide related visualizations, summary statistics, and verbal descriptions.

## 5. Modeling Process (the main portion): 

Provide a reproducible modeling process (with all codes and comments, from Data to Models) of fitting 

     a) an initial baseline (simple) model for comparison 
     b) ...