## Project 3: Diagnostic Classification System
- 01 Web Scraping and Data Cleaning

#### Content
> * [Background](#Background)
> * [Problem Statement](#Problem-Statement) 
> * [Web Scraping and Data Cleaning](#Web-Scraping-And-Data-Cleaning)
> * [Exploratory Data Analysis](#Exploratory-Data-Analysis)
> * [Preprocessing and Modeling](#Preprocessing-And-Modeling)
> * [Baseline Model Performance Evaluation](#Baseline-Model-Performance-Evaluation)
> * [Evaluation and Conceptual Understanding](#Evaluation-And-Understanding)
> * [Conclusion and Recommendations](#Conclusion-And-Recommendations)

---
<a id='Background'></a>
### Background

Research has shown that gifted children can exhibit some of the same behaviours as those on the autism spectrum. Social quirkiness is normal in bright kids as well as in those with autism. Like kids on the spectrum, gifted kids also have keen memories and a good grip on language. They can also get lost in their imaginations or think logically and critically to the extent that imagination seems far away. Both groups can also find it difficult to manage social interactions with their peers. 

At the same time, these are broad generalisations of behaviours. Once you dive deeper, you can see there are some salient differences. For example, a gifted child may present an extensive and advanced vocabulary with a rich verbal style. A child on the autism spectrum may have an advanced use of vocabulary, but they may not have full comprehension of the language they use. They may also have a less inviting verbal style that lacks the engagement of others. By extension, these differences can cause students with aspergers to have learning styles and needs that deviate from those of gifted students. 

Studies have shown that it may be possible to distinguish children on the autism spectrum from gifted children by examining their use of language. This forms the basis of Project 3. This project will address convergent and divergent aspects in communication & language between individuals with autism and gifted individuals.

Citations: 
- Tai, J., & Goy, P. (2021). Study: 1 in 150 Children in Singapore Has Autism. The Straits Times.
- Chen, L., Abrams, D. A., Rosenberg-Lee, M., Iuculano, T., Wakeman, H. N., Prathap, S., Chen, T., & Menon, V. (2019). Quantitative analysis of heterogeneity in academic achievement of children with autism. Clinical psychological science : a journal of the Association for Psychological Science, 7(2), 362–380. 
- Lim, P. (2018). "Specific Language Impairment in Children with High-Functioning Autism Spectrum Disorder." Inquiries Journal, 10(05).
- Aggarwal, R., Ringold, S., Khanna, D., Neogi, T., Johnson, S. R., Miller, A., Brunner, H. I., Ogawa, R., Felson, D., Ogdie, A., Aletaha, D., & Feldman, B. M. (2015). Distinctions between diagnostic and classification criteria?. Arthritis care & research, 67(7), 891–897. 
- Minshew, N. J., Goldstein, G., & Siegel, D. J. (1995). Speech and language in high-functioning autistic individuals. Neuropsychology, 9(2), 255–261.

---
<a id='Problem-Statement'></a> 
### Problem Statement

Context: The provision of healthcare in Singapore has become more challenging due to a couple of reasons.
1. Shifts in the nature of diseases highlights the system’s shortfall in managing complex chronic diseases
2. Evolution of healthcare consumer expectations
3. Manpower shortage in public hospitals to service the burgeoning aged population
4. Poor design of systems and operational inefficiencies lead to significant waste in Healthcare

Healthcare waste is incurred any time a patient, doctor, or healthcare worker engages in unnecessary medical activity - ranging from preventable mistakes in medical care, to misdiagnoses, provision of unnecessary treatments, and procedural inconsistencies. Research has shown that up to 20% of all healthcare resource expenditures are quality-associated waste and this can amount to a staggering sum. 

In Singapore, part of the costs (and risks) are borne by individuals and families, while part of the costs (and risks) are accounted for by the State - borne by taxpayers, and/or private health insurers. To eliminate waste, the Singapore government has come to incorporate technology into various care models to overcome the various cost- and quality-based challenges in the Healthcare sector. Most of these technologies are procured from private healthtech companies and start-ups. 

Citations: 
- Ooi, Low & Koh, Gerald & Tan, Lawrence & Yap, Jason & Chew, Samuel & Jih, Chin & Fung, Daniel & Sing, Lee & Lee, Patricia & Boon, Lim & Lim, Ruth & Low, James & Sachdev, Ravinder & Seah, Daren & Yeng, Siaw & Chiu, Tan & Teo, David & Tiwari, Satyaprakash & Tym, Wong & Scott, Richard. (2015). National Telemedicine Guidelines of Singapore.
- Nakhooda, F. (2021). The Bottom Line (Healthcare): Cutting Healthcare Waste: A Win-Win for Providers, Payers, Patients. The Business Times, Opinion & Features. 
- Khalik, S. (2018). Experts Highlight Prevalence and Cost of Waste in Healthcare Expenditure. The Straits Times. 

---
### Task
- What other diagnostic criteria can we extract using NLP-models to diagnose autistic and gifted children accurately?

You work in the Research and Development (R&D) of a healthtech startup in Singapore. The company has been enlisted by the Ministry of Health Holdings (MOHH) to create a simple diagnostic tool to rule out specific conditions and dieseases. After the development of a differential diagnosis, this tool will be a core feature in the series of additional tests that will conducted by healthcare professionals to rule out either autism or giftedness. Healthcare professionals will be able to come to a final diagnosis that is more accurate/precise, reducing the likelihood of misdiagnosis produced by the exisiting slew of subjective tests.

---
<a id='Web-Scraping-And-Data-Cleaning'></a>
### Web Scraping and Data Cleaning

Load Packages.

In [1]:
import pandas as pd
from pandas import DataFrame
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from nltk.corpus import stopwords

In [2]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

Scrape data from SubReddits: r/aspergers and r/Gifted. Note that the former sub has a total of 135k members while the latter has only 17.1k members. 

In [3]:
# install PRAW using pip on terminal, base environment

In [4]:
# import module for scraping
import praw

In [5]:
reddit = praw.Reddit(client_id='hlMNa_ZB6nn0GRcmQaxacA', 
                     client_secret='VRV96TLQIS7cqu_eEa0ZQTT0VU61Cg', 
                     user_agent='WebScraper3')

Get 10 hottest posts from the individual subreddits.

In [6]:
# get 10 hot posts from the aspergers subreddit
hot_posts = reddit.subreddit('aspergers').hot(limit=10)
for post in hot_posts:
    print(post.title)

RequestException: error with request HTTPSConnectionPool(host='www.reddit.com', port=443): Max retries exceeded with url: /api/v1/access_token (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x1653464f0>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))

In [None]:
# get 10 hot posts from the Gifted subreddit
hot_posts = reddit.subreddit('Gifted').hot(limit=10)
for post in hot_posts:
    print(post.title)

Obtain general information about the subreddit using .description() function on the subreddit object. 

In [None]:
# Get aspergers subreddit data 
aspergers_sub = reddit.subreddit('aspergers')

print(aspergers_sub.description)

In [None]:
# Get Gifted subreddit data
gifted_sub = reddit.subreddit('Gifted')

print(gifted_sub.description)

The variable `aspergers_sub` can be iterated over and features including the post title, id and url can be extracted and saved into a .csv file.

In [None]:
aspergers_df = []
for post in aspergers_sub.hot(limit=1750):
    aspergers_df.append([post.title, post.score, post.id, post.subreddit, post.url, 
                 post.num_comments, post.selftext, post.created])
aspergers_df = pd.DataFrame(aspergers_df, columns = ['title', 'score', 'id', 'subreddit', 'url',
                                       'num_comments', 'body', 'created'])
print(aspergers_df.tail(10))

The variable `gifted_df` can be iterated over and features including the post title, id and url can be extracted and saved into a .csv file.

In [None]:
gifted_df = []
for post in gifted_sub.hot(limit=1750):
    gifted_df.append([post.title, post.score, post.id, post.subreddit, post.url, 
                 post.num_comments, post.selftext, post.created])
gifted_df = pd.DataFrame(gifted_df, columns = ['title', 'score', 'id', 'subreddit', 'url',
                                       'num_comments', 'body', 'created'])
print(gifted_df.tail(10))

Create a new colmun for top level comments for both dataframes (optional).

In [None]:
# create a new column 'comment' for aspergers_df
# aspergers_df['comment'] = ' '

In [None]:
# create a new column 'comment' for gifted_df
# gifted_df['comment'] = ' '

Obtain comments for a post/submission by creating/obtaining a `submission` object and looping through comments attribute (optional).

In [None]:
# create a 'submission' object
# for i in range(len(aspergers_df)): 
    # submission = reddit.submission(url= aspergers_df['url'][i])
    # to get the top-level comments, iterate over `submission.comments`
    # for top_lvl_comment in submission.comments:
        # aspergers_df['comment'][i] += top_lvl_comment.body
# print(aspergers_df.head(3))

In [None]:
# create a 'submission' object
# for i in range(len(gifted_df)): 
    # submission = reddit.submission(url= gifted_df['url'][i])
    # to get the top-level comments, iterate over `submission.comments`
    # for top_lvl_comment in submission.comments:
        # gifted_df['comment'][i] += top_lvl_comment.body
# print(gifted_df.head(3))

Remove unneccessary columns from each dataframe. 

In [None]:
# list out all the columns in the dataframe
aspergers_df.columns

In [None]:
# remove unneccessary columns from aspergers_df
aspergers_df.drop(columns=['score', 'id', 'subreddit', 'url', 'num_comments', 'created'], inplace=True) 

In [None]:
aspergers_df.columns

In [None]:
# list out all the columns in the dataframe
gifted_df.columns

In [None]:
# remove unnecessary columns from gifted_df
gifted_df.drop(columns=['score', 'id', 'subreddit', 'url', 'num_comments', 'created'], inplace=True)

In [None]:
# list out all the columns in the dataframe
gifted_df.columns

Drop NAs and reset the index. 

In [None]:
# check to see if there are empty cells in `aspergers_df`
# no NAs to drop
aspergers_df.isnull().sum()

Comment: There are no null values in `aspergers_df` to remove. 

In [None]:
# check to see if there are empty cells in `gifted_df`
gifted_df.isnull().sum()

Comment: There are no null values in `gifted_df` to remove.

Sort the data into X and y columns.


1. Sort the data from `aspergers_df` into `text_feature` and `diagnosis`.

In [None]:
# create a new column X for text_features
aspergers_df['text_feature'] = aspergers_df[['title', 'body']].apply(" ".join, axis=1)

In [None]:
# engineer a feature 'diagnosis', a 1/0 binary column, where '1' indicates 'aspergers'
aspergers_df['diagnosis'] = 1

In [None]:
aspergers_df.head(3)

In [None]:
# drop redundant columns
aspergers_df.drop(columns=["title", "body"], inplace=True)

In [None]:
# save aspergers_df to csv
aspergers_df.to_csv('../data/aspergers_df.csv', index=False)

2. Sort the data from `gifted_df` into `text_feature` and `diagnosis`.

In [None]:
# create a new column X for text_features
gifted_df['text_feature'] = gifted_df[['title', 'body']].apply(" ".join, axis=1)

In [None]:
# engineer a feature 'diagnosis', a 1/0 binary column, where '0' where '0' indicates 'gifted'
gifted_df['diagnosis'] = 0

In [None]:
gifted_df.head(3)

In [None]:
# drop the redundant columns
gifted_df.drop(columns=["title", "body"], inplace=True)

In [None]:
# save gifted_df to csv
gifted_df.to_csv('../data/gifted_df.csv', index=False)

Concatenate the data into one single dataframe and save `df` as csv for preprocessing and modeling. 

In [None]:
# concatenate the data into one single dataframe along rows
df = pd.concat([aspergers_df, gifted_df], axis=0)

In [None]:
df.diagnosis.value_counts()

In [None]:
# reset the index after appending the dataframes together by rows
df.reset_index(inplace=True, drop=True)

In [None]:
df.tail(3)

In [40]:
# save df.csv for preprocessing and modeling
df.to_csv('../data/df.csv', index=False)