## Loading Data


### Data frame and basic operations

In Python, Pandas is a common used module to read from file into a data frame. I downloaded the Auto.csv from the book website. First, take a look at the csv file. There are headers, missing value is marked by '?'. The data is separated by comma. We can use the `read_csv` function to read the csv file into a data frame. The `read_csv` function has many parameters, we can use `?` to get the documentation of the function. 

The following code shows how to read the csv file "Auto.csv" in the text book into a data frame `auto_df`.

In [None]:
import pandas as pd
import urllib

data_url = "https://github.com/pykale/transparentML/raw/main/data/Auto.csv"
# res = urllib.urlopen(data_url)
auto_df = pd.read_csv(data_url, header=0, na_values="?")

The `.head()` method can be used to get the first 5 (by default) rows of the data frame.

In [None]:
auto_df.head()

The `.describe()` method can get the summary statistics of the data frame. Specify the argument `include` to get the summary statistics of certain variables, e.g. `include = "all"` for mix types, `include = [np.number]` for numerical columns, and `include = ["O"]` for objects.

In [None]:
auto_df.describe()

In [None]:
auto_df.describe(include="all")

The dimension of a data frame can be found out by the same `.shape()` method as in `Numpy` arrays.

In [None]:
auto_df.shape

Indexing in Pandas data frame is similar to indexing in `Numpy` arrays. A row, a column, or a submatrix can be accessed by the `.iloc[]` or `.loc[]` method. `iloc` is used to index by position, and `loc` is used to index by labels (row and column names). 

In [None]:
auto_df.iloc[:4, :2]

In [None]:
auto_df.loc[[0, 1, 2, 3], ["mpg", "cylinders"]]

An alternative way to select the first 4 rows.

In [None]:
auto_df[:4]

The column names can be found out by `list` function or the `.columns` attribute.

In [None]:
print(list(auto_df))
print(auto_df.columns)

`.isnull()` and `.sum()` methods can be used to find out how many NaNs in each variables.

In [None]:
auto_df.isnull().sum()

In [None]:
# after the previous steps, there are 397 obs in the data and only 5 with missing values. We can just drop the ones with missing values
print(auto_df.shape)
auto_df = auto_df.dropna()
print(auto_df.shape)

The type of variable(s) can be changed. The following example will change the cylinders into categorical variable

In [None]:
auto_df["cylinders"] = auto_df["cylinders"].astype("category")

### Visualising data

Refer a column of data frame by name, by using a `.column_name`. See the options in plt.plot for more.

In [None]:
plt.plot(auto_df.cylinders, auto_df.mpg, "ro")
plt.show()

The `.hist()` method can get the histogram of certain variables. Specify the argument `column` to get the histogram of a certain variable.

In [None]:
auto_df.hist(column=["cylinders", "mpg"])
plt.show()


### Exercises

This exercise is related to the College dataset, which can be found in the file college.csv in [Table 1.2](https://pykale.github.io/transparentML/01-intro/organisation.html#datasets-table). It contains a number of variables/features for 777 different universities and colleges in the US.

Q1. Use the read_csv() function to read the data. Call the loaded data college_df. Print the first 20 rows of the loaded data. Make sure that you have the directory set to the correct location for the data.  

Q2. Find the number of variables/features in the dataset and print them.  

Q3. Use the describe() function to produce a numerical summary of the variables/features in the data set.  

Q4. How many quantitative and qualitative variables are the in this dataset?  

Q5. Use the plot() function to produce side-by-side boxplots of any 2 variables/features chosen by you.  

Q6. Continue exploring the data, and provide a brief summary of what you discover.  
