# Open datasets

# Acquiring Data from open repositories

A crucial step in the work of a computational biologist is not only to analyse data, but acquiring datasets to analyse as well as toy datasets to test out computational methods and algorithms. The internet is full of such open datasets. Sometimes you have to sign up and make a user to get authentication, especially for medical data. This can sometimes be time consuming, so here we will deal with easy access resources, mostly of modest size. Multiple python libraries provide somekind of a `dataset` module which makes the effort to fetch online data extremely seamless, with little requirement for preprocessing.


### Goal of the notebook

Here you will get familiar with some ways to fetch datasets from online. We do some data exploration on the data to see how its shaped by using some basic plots. In previous notebook you learned how to use seaborn and we will use it here too. 

## Import libraries

In [None]:
# import basic libraries
import numpy as np
import pandas as pd
import seaborn as sns

# Fetching online data using modules

We start with **scikit-learn's** dataset module. Scikit-learn is a machine learning library that we will use in the Part 3 of this course. Visit [here](https://scikit-learn.org/stable/modules/classes.html?highlight=datasets#module-sklearn.datasets) for an overview of the datasets it provides.

In [None]:
from sklearn.datasets import load_iris

Remember in the last notebook we loaded the iris dataset directly from URI using `pd.read_csv()`. We could've also used seaborn's load_dataset module: `sns.load_dataset("iris")`. Here we load the iris dataset from scikit-learn's datasets module:

In [None]:
# load data as object
data = load_iris()

# You can also set parameter 'return_X_y' True which returns X (data) and y (class) separately:
# X, y = load_iris(return_X,y = True)

In [None]:
# data is a dictionary that has all kind of information
data

You can access different attributes same way you access columns in DataFrame:

In [None]:
print(data.DESCR)

# or
# data['DESCR']

Let's separate the data and create a Dataframe:

In [None]:
df = pd.DataFrame(data=data.data, columns=data.feature_names)
df.head()

<div class='alert alert-warning'>
    <h4> Exercise 1.</h4>  DataFrame is still missing the classes. Make a new column <b>species</b> that has the values of <b>'target'</b>. How do you get the target_names instead of just numbers?
    </div>

In [None]:
# Ex1


In [None]:
# %load solutions/ex2_1.py

In [None]:
df.head()

Describe is great function to get better view of your data:

In [None]:
df.describe()

And just like in the previous notebook you can visualize your data using seaborn:

In [None]:
sns.barplot(x='species', y='sepal length (cm)', data=df)

To see if features correlate with each other you can use `corr()`:

In [None]:
print(df.corr())

# Import local data

If you have downloaded data to your computer (in csv format) its easy to access it using e.g. pandas `read_csv` function. Here we have Covid impact on airport traffic data

In [None]:
df = pd.read_csv('data/covid_impact_on_airport_traffic.csv')

In [None]:
df.head()

Here you might want to get rid of a few columns we don't need. Let's keep the date, city and PercentOfBaseline which represents the proportion of trips on this date as compared to Avg number of trips on the same day of week in baseline period.

In [None]:
# keep only these columns
df = df[['Date', 'PercentOfBaseline', 'City']]
df.head()

Let's take a quick look into Sydney airport traffic:

In [None]:
sydney = df.loc[df.City=='Sydney'] # subset of only sydney traffic
sydney.describe()

In [None]:
sydney.plot(x='Date', y='PercentOfBaseline')

<div class='alert alert-warning'>
    <h4> Exercise 2.</h4>  Is the plot making sense? What should've been done before plotting?
    </div>



In [None]:
# Ex2

In [None]:
# %load solutions/ex2_2.py

In [None]:
# try plotting again
sydney.plot(x='Date', y='PercentOfBaseline')

# Import your own data

Now choose one of the following datasets from Kaggle, download and move it in the data-folder and then do some data exploration. Use examples from previous notebook to make plots.

[Heart disease dataset](https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction)

[Breast cancer dataset](https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data)

[Diabetes dataset](https://www.kaggle.com/datasets/mathchi/diabetes-data-set)

**Load csv**

**What do the rows and columns represent here?**

**How many samples does the data have? What about features?**

**Choose a few features and plot histograms of them. What do they tell you? Describe the plots.**

All three listed datasets have something called a target variable. It's the feature of the dataset you want to gain better understanding and other features might explain. Usually its some kind of class. For example in iris dataset the target would be the species and explaining variables are the length and the width of the sepals and petals.

**What is the target variable in your data? What are the different classes (e.g. if the values are 0s and 1s, what do they mean?)**

In [None]:
# 

**Calculate correlation between all numeric features using DataFrame function `corr()`. Then use it to make a heatmap.**

<div class='alert alert-info'>
    <b>Hint!</b> 
    
You can change categorical (non-numeric) features to dummy variables (1/0 i.e. True/False) using 
    `pd.get_dummies()`
    method. E.g. If you have DataFrame <b>df</b> that has column <b>Gender</b> that has M and F values, you can run get_dummies(df, column=['Gender']) and you get two new columns: Gender_F and Gender_M that both consist of boolean values 1 and 0.
    </div>

**Choose features that correlate (positively or negatively) with your target and make a pairplot using your target as hue. Can you make any interesting observations?**

**Now make at least three other plots of your choice to see what the data is like. You can also calculate statistics and use the information from the heatmap to get ideas what could be interesting to plot. Feel free to experiment and write down any observations you can make from the plots.**


---

# Useful resources and links

Kaggle is a great place for datasets but there is a lot of other good open sources as well. Here is a list of useful links for public datasets perfect for data-analysis and machine learning:


### Links
- [OpenML](https://www.openml.org/search?type=data)
   - large variety of datasets for machine learning
- [Nilearn datasets](https://nilearn.github.io/modules/reference.html#module-nilearn.datasets)
   - provides a collection of neuroimaging datasets
- [Sklearn datasets](https://scikit-learn.org/stable/modules/classes.html?highlight=datasets#module-sklearn.datasets)
- [Kaggle](https://www.kaggle.com/datasets)
   - Competition website including wide variety of public datasets and notebooks of users
- [MEDMNIST](https://medmnist.com/)
   - Biomedical Images for classification


- [**Awesomedata**](https://github.com/awesomedata/awesome-public-datasets)

 - We strongly recommend to check out the Awesomedata lists of public datasets, covering topics such as [biology/medicine](https://github.com/awesomedata/awesome-public-datasets#biology) and [neuroscience](https://github.com/awesomedata/awesome-public-datasets#neuroscience)

- [Papers with code](https://paperswithcode.com)
   - free and open resource with Machine Learning papers, code, datasets, methods and evaluation tables

- [SNAP](https://snap.stanford.edu/data/)
  - Stanford Large Network Dataset Collection  
- [Open Graph Benchmark (OGB)](https://github.com/snap-stanford/ogb)
  - Network datasets
- [Open Neuro](https://openneuro.org/)
   - platform for validating and sharing neuroimaging data