<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

#  Capstone Project: Train Delays Predictor
---

### Problem Statement and Solution Approach:

**Problem:**<br>
Despite ongoing efforts to improve the MRT system, train delays and faults persist, causing frustration and inconvenience for passengers.

**Proposed Solution:**<br>As a daily commuter of Singapore MRT, I aim to develop a train delays predictor that can identify stations and timings that are more likely to experience breakdowns or delays. 

By analyzing historical data on time of day, type of day, station name, commuter volume, and breakdown/non-breakdown indicators, I hope to create a model that can accurately predict future breakdowns and help commuters avoid stations and times with potential delays.

## 1. Data Collection: Web scraping

### Context:
Scraping data from SMRT twitter profile tweets with two columns being Timestamp and Tweets.

In [52]:
# Import Dependencies

import selenium
import random
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException
from time import sleep
import pandas as pd

**Set up the scraper using selenium**

**Set up some conditions (for example, I only want to scrape data from the period 2018 to 2023)**

**Start scraping!**

In [79]:
# Set up the webdriver
driver = webdriver.Chrome()

# Navigate to the Twitter page you want to scrape
url = "https://twitter.com/SMRT_Singapore"
driver.get(url)

# Wait for the page to load
sleep(5)

# Set the start and end years you want to scrape
start_year = 2017
end_year = 2023

# Find the articles on the page
articles = driver.find_elements(By.XPATH,"//article[@data-testid='tweet']")

# Create an empty dictionary to store the timestamps and tweets
tweets_dict = {}

# Loop through the articles on the page
while True:
    for article in articles:
        # Get the timestamp and tweet text
        timestamp = article.find_element(By.XPATH,".//time").get_attribute('datetime')
        tweet = article.find_element(By.XPATH,".//div[@data-testid='tweetText']")

        # Parse the year from the timestamp
        year = int(timestamp[:4])
        
        # If the year is within the desired range, add the timestamp and tweet to the dictionary
        if year >= start_year and year <= end_year:
            if timestamp not in tweets_dict:
                tweets_dict[timestamp] = []
            tweets_dict[timestamp].append(tweet.text)
    
    # Scroll to the bottom of the page
    # driver.execute_script('window.scrollTo(0,document.body.scrollHeight);')

    # Scroll half a page
    driver.execute_script("window.scrollBy(0, window.innerHeight/2);")


    # Wait for the page to load new data for 3 to 5 seconds
    sleep(random.uniform(3, 5))
    
    # Get the new list of articles
    new_articles = driver.find_elements(By.XPATH,"//article[@data-testid='tweet']")
    
    # If there are no new articles, assume we have reached the end of the page and exit the loop
    if not new_articles:
        break
    
    # Update the articles list
    articles = new_articles
    
    # Check if the last tweet collected was from the latest year
    last_year = int(list(tweets_dict.keys())[-1][:4])
    if last_year <= start_year:
        break

# Print the number of timestamps and tweets collected
print(f"Number of timestamps collected: {len(tweets_dict)}")
print(f"Number of tweets collected: {sum([len(v) for v in tweets_dict.values()])}")

# Quit the webdriver
driver.quit()

Number of timestamps collected: 735
Number of tweets collected: 8359


**After scraping, transform scraped data which was stored in dictionary into dataframe.**

In [81]:
# create a list of tuples from the tweets_dict dictionary
tweets_list = [(k, v) for k, v in tweets_dict.items()]

# create a DataFrame from the list of tuples
df = pd.DataFrame(tweets_list, columns=['Timestamp', 'Tweet'])

# print the DataFrame
print(df.head())

                  Timestamp                                              Tweet
0  2023-03-14T09:30:00.000Z  [Have you shared your views yet?#ICYMI: We're ...
1  2023-03-09T01:43:20.000Z  [[BPLRT] Train services are running normally.,...
2  2023-03-08T13:00:03.000Z  [[BPLRT] UPDATE: Additional 10mins train trave...
3  2023-03-08T12:51:47.000Z  [[BPLRT] UPDATE: Additional 15mins train trave...
4  2023-03-08T12:42:24.000Z  [[BPLRT] CLEARED: Train services has resumed. ...


**Export scraped data to Excel**

In [1]:
df.to_excel("C:/Users/qiyua/Desktop/QIYUAN/CAREER/TFIP/TFIP_General Assembly/Course materials/CAPSTONE PROJECT/excel/tweets_live8.xlsx",index=False)

NameError: name 'df' is not defined

**We will continue the rest of the analysis in a separate workbook. Please refer to "2. Data Cleaning and Data Analysis" for the analysis and recommendations.**