
# Stock Tweet and Price Analysis

In this notebook, we will analyze stock prices and tweets related to various companies. We will perform data preprocessing, exploratory data analysis (EDA), and implement time series forecasting models.

## Data Sources

- **Tweets and financial price data** obtained from Twitter API and Yahoo Finance.
- 10,000 rows in `stocktweet.csv`:
  - **Fields**: IDs, Date (2020), Company ticker, Text of the tweet.
- 38 ticker stock price CSVs:
  - **Fields**: Date (2020), Opening price, Highest value, Lowest value, Closing value, Adjusted closing value, Volume of stocks traded on that day.

## Time Series Forecasting

- Close price forecasting using tweets and financial price data for 1-day, 3-day, and 7-day periods.
- Create a dynamic dashboard to visualize the forecasts.

## Project Requirements

1. **Distributed Data Processing**:
   - Utilize a distributed data processing environment like Spark for some part of the analysis.

2. **Data Storage and Processing**:
   - Source datasets in SQL/NoSQL databases before processing using MapReduce/Spark (HBase, HIVE, Spark SQL, Cassandra, MongoDB).
   - Write to NoSQL with Hadoop or Spark.
   - Post MapReduce, store the data in an appropriate NoSQL database.

3. **Data Extraction**:
   - Store the processed data and perform a follow-up analysis on the output.
   - Extract data from the NoSQL database into another format, such as CSV, to import into Python.

4. **Database Performance Analysis**:
   - Perform a comparative analysis of two databases (e.g., MySQL, MongoDB, Cassandra, HBase, CouchDB).
   - Record appropriate metrics and conduct a quantitative comparison of the two chosen database systems.

5. **Sentiment Extraction**:
   - Provide evidence and justification of the sentiment extraction techniques used.

6. **Time Series Forecasting Methods**:
   - Explore at least two methods of time series forecasting, including:
     - One Neural Network model.
     - One Autoregressive model (e.g., ARIMA, SARIMA).
   - Handle short time series data appropriately.

7. **Analysis and Forecasting**:
   - Justify the choices made for the final analysis.
   - Include forecasts for 1-day, 3-day, and 7-day periods.

8. **Dynamic Dashboard**:
   - Design an interactive dashboard to display the forecasts.
   - Include design rationale expressing Tuft's principles.

## Big Data Processing

- Prepare and process data using MapReduce/Spark environments.
- Conduct a comparative analysis for two databases (SQL and NoSQL) using YCSB.
- Provide rationale and justification for data processing, storage, and programming language choices.

## Architecture Design

- Design the architecture for processing big data using necessary technologies (HADOOP/SPARK, NoSQL/SQL databases, and programming).
- Present the design in the form of a diagram and discussion in the report.

**Note**: MapReduce-style processing in this context includes platforms such as Apache Spark.

## Advanced Data Analytics

- **Rationale, Evaluation, and Justification**:
  - Justify the choices made in terms of EDA, data wrangling, machine learning models, and algorithms implemented.
- **Hyperparameter Tuning**:
  - Evaluate and justify the hyperparameter tuning techniques used in the models.
- **Analysis of Forecast**:
  - Analyze the data and perform 1-day, 3-day, and 7-day forecasts of the close price for 5 companies using tweets and financial price data.
- **Presentation of Results**:
  - Use figures, captions, tables, and the dynamic dashboard to present the forecast results effectively.
- **Tuft's Principles**:
  - Discuss how Tuft's principles are applied in the design of the dashboard.


## Time Series Forecast

The project involves making a time series forecast of the CLOSE price for at least 5 companies using both the tweet data and financial price data. Forecasts are made for 1 day, 3 days, and 7 days into the future and displayed on a dynamic dashboard.

### Project Requirements and Elements

- **Distributed Data Processing:** 
  - The project incorporates a distributed data processing environment like Spark for part of the analysis.

- **Data Storage in SQL/NoSQL Databases:** 
  - Source datasets are stored in SQL/NoSQL databases prior to processing using MapReduce or Spark (HBase, HIVE, Spark SQL, Cassandra, MongoDB).
  - Data is loaded into the NoSQL database using an appropriate tool (Hadoop or Spark).

- **Post Map-Reduce Processing:** 
  - Post MapReduce, the datasets are stored in an appropriate NoSQL database.
  - The processed data is then extracted from the NoSQL database into another format (e.g., CSV) for further analysis in Python.

- **Comparative Analysis of Databases:** 
  - A test strategy is devised to perform a comparative analysis of the capabilities of two databases (e.g., MySQL, MongoDB, Cassandra, HBase, CouchDB).
  - Metrics are recorded, and a quantitative analysis is performed to compare the performance of the chosen database systems.

- **Sentiment Extraction Techniques:** 
  - Evidence and justification of the sentiment extraction techniques used in the analysis.

- **Time-Series Forecasting Methods:** 
  - At least two methods of time-series forecasting are explored, including:
    - **1 Neural Network Model:** (e.g., LSTM)
    - **1 Autoregressive Model:** (e.g., ARIMA, SARIMA)
  - Since this is a short time series, considerations are made on how to handle the forecasting effectively.

- **Final Analysis and Justification:** 
  - Justifications for the choices made in the final analysis are provided, along with the forecasts for 1 day, 3 days, and 7 days going forward.

- **Dynamic and Interactive Dashboard:** 
  - The dashboard must be dynamic and interactive.
  - The design rationale must express Tuft's principles.