# Notes
- Use pip freeze to generate requirements.txt
- Group data for each year (SQL select) into a dataframe

# Requirements

* Query the dataset using sqlite. Only load the final dataset into a dataframe.

* Give an overview of the respondents of the survey. What is the sample size?
* What are the sociodemographic features of the respondents? Do you see any evidence of sampling bias?
* Perform exploratory data analysis. This should include creating statistical summaries and charts, checking for correlations and other relationships between variables, as well as other EDA elements.
* In a plot, report the prevalence rate of at least three mental diseases. (https://en.wikipedia.org/wiki/Prevalence)
* Make sure to plot the confidence interval and provide its interpretation.
* Your notebook should be readable as a standalone document. In Markdown cells inform the reader of the questions you are trying to answer, and provide an interpretation of your results.
* Provide suggestions about how your analysis can be improved.
# Questions to answer (general)

- What are the main types and subtypes of data?
- What are the main metrics of location? What are their main characteristics?
- What is variability? What are the main metrics of variability and their characteristics?
- What is a confidence interval? Why do we need it? Why is it not sufficient to just report the point estimates?
- What is correlation? How do we use it to analyze data?
- What is a contingency table?

# Plan of action

- Import data into a single dataframe, that is coherent (it makes sense looking at it)
- Review the data
- Clean the data
- Perform exploratory data analysis, main goal

In [10]:
import sqlite3
import pandas as pd

# Connect to your database
conn = sqlite3.connect('mental_health.sqlite')  # Updated database name

# TODO: double check this query is correct and joins correctly, otherwise all following work is invalid
query = """
SELECT 
    s.SurveyID as year,
    s.Description as survey_description,
    a.UserID as user_id,
    q.QuestionText as question_text,
    a.AnswerText as answer_text
FROM Answer a
JOIN Question q ON a.QuestionID = q.QuestionID
JOIN Survey s ON a.SurveyID = s.SurveyID
"""

df = pd.read_sql_query(query, conn)
df.columns = df.columns.str.lower()
conn.close()

df.head(100)

Unnamed: 0,year,survey_description,user_id,question_text,answer_text
0,2014,mental health survey for 2014,1,What is your age?,37
1,2014,mental health survey for 2014,2,What is your age?,44
2,2014,mental health survey for 2014,3,What is your age?,32
3,2014,mental health survey for 2014,4,What is your age?,31
4,2014,mental health survey for 2014,5,What is your age?,31
...,...,...,...,...,...
95,2014,mental health survey for 2014,96,What is your age?,29
96,2014,mental health survey for 2014,97,What is your age?,24
97,2014,mental health survey for 2014,98,What is your age?,31
98,2014,mental health survey for 2014,99,What is your age?,33
