# Collecting Data

The foundation of your project will be to collect appropriate data for the problem you are trying to solve. There is incredible versatility in how you can use an machine learning model. Some examples might be to predict the next social media trend, categorizing cancer cells, trying to better understand the relationship between temperature fluctuations and coffee bean growth rates, or creating a chat bot.

Since we are very early in this course, and - most likely at this point - have not discussed many models, it may be difficult to know what you can do with a dataset, or what is in scope of this course. So by the nature of the course title "Exploring Machine Learning" we will take an explorative approach to your project. 

The goal of this part of the project is to explore what data sets you might be interested in, below will be questions to help guide you to selecting a category of data that you want to further explore.

## Identifying what data you want to explore

Data is everywhere and there seems to be data on about anything. You might know exactly what you want to dive deeper into or you might have no idea. Either way I invite you answer the questions below.

Below create a python dictionary with the key being a short summary of the topic of interest, while the value is an explanation of your interest - such as why you are interested in this topic, or why do you feel a strong passion to understand this topic. A topic of interest could be research you are conducting, a topic you are studying at your job, hobbies you have or topics surrounding your identities.

List 5 topics, and for each topic put at least a 50 word description.

For example I might put:

```python
interests = {
    "Cats" : "I have two cats at home, they are basically my children. I would generally like to learn more about cat behavior, health trends and pet owner behavior. It may be interesting to also see industry trends of cat owner, or how they compare to dog owners. Maybe later I want to start to write an app that recognizes cat breeds",

    "Scuba" : "The study of scuba diving seems to be a 'soft' science, and there are general guidelines on when and how long you should do safety stops to avoid getting decompression sickness. Could there be links to human anatomy or behavior on how deep a person should go safely during a dive?",
}
```

In [None]:
# Do not edit the name of this function, it will be used for grading
def what_are_topics_you_are_interested_in():
    interests = {
        "Air Travel": "I love traveling, especially by airplane, being able to explore so many different \
        kinds of airports and airlines. I am interested in seeing the potential correlation between various \
        factors such as time of day, airline, and more as well as how they impact flight delays. The idea is \
        to identify high-risk of delay flights and airports so that consumers can plan their travel accordingly.",
        "COVID-19": "I have worked on bioinformatics/biostatistics projects in the past, and it may seem \
        simple enough but I want to understand how socioeconomic factors like a country's GDP, population, \
        education level, and more impacted the spread of COVID in different countries. I could then build a \
        potential model to detect disease spread based on various factors and assign them appropriate weights.",
        "Election Prediction": "I really enjoy reading about and learning more about political science \
        (in my free time) and especially since this is an election year, I want to design a potential model \
        that could help predict the Presidential election result. I realize this is a daunting task but I will \
        start slow and ask questions like, \"How can we predict a voter's registration based on factors like \
        age, region, education level, etc?\" and then going to more complex questions like how can we predict \
        a state's electoral vote outcome?",
        "Cancer Detection": "Again, since a lot of my prior research has focused on bioinformatics and \
        I have even done a project on breast cancer genomics analysis when in high school, I would like to \
        extend this work and develop a machine learning model to predict whether an individual has or will \
        develop cancer based on genomic data.",
        "Stock Market": "This is a tricky one but I want to try to create a model that predicts stock price \
        based on real data and previous trends. I can utilize the yfinance (Yahoo Finance) library \
        to get historical data and use it to first 'predict' previous prices. If the model is able to \
        accurately idenify previous prices, I will have it try to predict future prices. This will be \
        useful for amateur investors like me to help them take calculated risks.",
        "Housing Market": "Within the past few years, housing prices have skyrocketed, even in smaller \
        cities. Thus, I want to develop a machine learning model to help predict the estimated \
        price of a house based on factors like location, square footage, number of rooms, amenities, \
        and more which would help a first-time home owner identify their ideal budget."
    } # Fill out your interests
    return interests
# Note: you can use the \ symbol to continue your string to the next line, this makes 
# things look a bit prettier
# Example:
print("This is an \
      extended string ")

## Do datasets exist for my interests?

There is lots of data out there but not for everything. Below are some websites where you can take a look at available datasets. Go ahead and search for datasets related to your topic. Are there many data sets surrounding your topic? Are there many different types of data like categorical, regression, images, etc? If there are limited data sets, do you feel comfortable with the challenge of creating your own data? (Note creating your own data set to supplement existing datasets will increase your score on this assignment)

> You can find a link to databases on the course page!

For 3 of your topics find 3 databases you might want to use for your project. Below create a dictionary with the keys being the topic values you listed above and the value a link of 3 data bases you would like to explore. If you would like to make your own data too make add a string "Create my own data" to the end of the list

Note, if you have trouble finding datasets for your topic you can make your dataset more general, or try a different topic. For example for my "Cats" topic I could expand it to "Pets", "Pet Toy Sales" or "Pet Health Benefits"

You can always change your topic and dataset later, so don't feel that these decisions are permanent.

While searching did it generate any ideas on interests or data sets you would like to explore? - If so you can add or replace a topic to the dictionary above!


Example"

```python
datasets = {
    "Cats" : ["https://www.kaggle.com/datasets/ma7555/cat-breeds-dataset", "https://example.com", "https://example.com", "Create my own data"],

    "Second Topic": ["https://example.com", "https://example.com", "https://example.com"],
    
    "Third Topic" : ["https://example.com", "https://example.com", "https://example.com"]
}
```


In [7]:
def find_some_datasets():
    datasets = {
        "Air Travel": ["https://www.kaggle.com/datasets/patrickzel/flight-delay-and-cancellation-dataset-2019-2023",
        "https://www.kaggle.com/datasets/undersc0re/flight-delay-and-causes", "https://www.kaggle.com/datasets/jimschacko/airlines-dataset-to-predict-a-delay"],

        "COVID-19": ["https://health.google.com/covid-19/open-data/raw-data", 
        "https://datacatalog.worldbank.org/search/collections/covid_19", "https://www.kaggle.com/datasets/imdevskp/corona-virus-report"],

        "Election Prediction": ["https://www.census.gov/topics/public-sector/voting.html", "https://www.kaggle.com/datasets/tunguz/us-elections-dataset"
                                ,"https://www.ncsbe.gov/results-data/voter-registration-data", "Create my own data"],

        "Cancer Detection": ["https://www.kaggle.com/datasets/crawford/gene-expression/data", 
        "https://www.kaggle.com/datasets/raghadalharbi/breast-cancer-gene-expression-profiles-metabric", "https://www.kaggle.com/datasets/erdemtaha/cancer-data"],

        "Stock Market": ["https://www.kaggle.com/datasets/borismarjanovic/price-volume-data-for-all-us-stocks-etfs", 
            "https://www.kaggle.com/datasets/jacksoncrow/stock-market-dataset", "https://www.kaggle.com/datasets/aaron7sun/stocknews", "Create my own data"],

        "Housing Market": ["https://www.kaggle.com/datasets/vikrishnan/boston-house-prices", "https://www.kaggle.com/datasets/fedesoriano/the-boston-houseprice-data",
                           "https://www.kaggle.com/datasets/schirmerchad/bostonhoustingmlnd", "Create my own data"]
        
    }
    return datasets

## Asking questions about your dataset

Some questions you might want to ask for each dataset are:
- Who created this dataset?
- When was this dataset created?
- Could there be any biases when creating this dataset?
- How was this data collected?
- Is this data representative of the problem I am trying to solve?
