# 5 Day Data Challenge (Python) - Day 1

This notebook follows the tutorial by [Rachael Tatman](https://www.kaggle.com/rtatman) from the [5 Day Data Challenge](https://www.kaggle.com/rtatman/the-5-day-data-challenge). You can also watch the [accompanying video](https://youtu.be/SFQ1ECXiUME?t=8m55s).

## 0. Choosing a dataset

Pick one of the datasets from [this list](https://www.kaggle.com/rtatman/fun-beginner-friendly-datasets/) of beginner friendly datasets, and click on **New Kernel** and then on **Notebook**. This will open up a new notebook. Choose Python as programming language in the drop down menu at the top.

## 1. Import libraries

The first step is to import libraries. In Python the most important ones are [NumPy](http://www.numpy.org/), [Pandas](https://pandas.pydata.org/) and [Matplotlib](https://matplotlib.org/).

Click on the cell below and click on the blue triangular 'play' button on the left hand side to run it. Alternatively, you can press `'Ctrl + Enter'` to run the cell.

In [None]:
import numpy as np      # linear algebra
import pandas as pd     # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt    # data visualization

## 2. Input files

Input data files are available in the `"../input/"` directory. We can list the available files from the input directory as shown below. Alternatively, you can click on `[+] Input Files` at the top.

In [None]:
# list files from input directory
from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

## 3. Markdown and Code cells

You can add cells to the notebook by clicking on an existing cell and clicking on the blue 'plus' symbols next to the red trash symbol. You can also choose between two[](http://) types of cells.

**1. Markdown**
> Markdown cells allow you to type regular text with formatting options using Markdown.

**2. Code**
> Code cells are meant for your source code either in Python or R depending on which programming language you have chosen for your notebook.

## 4. Read in data

In Python we can use the function `pd.read_csv()` from the Pandas library to read in CSV files (CSV stands for 'comma separated values'), and store the content in a so-called dataframe `df`. You can think of a dataframe as a table similar to one in Excel.  

We pass the file path as a string to the function `pd.read_csv()`:

`df = pd.read_csv('../input/archive.csv')`

**Note:** 

When typing the file path you can also use `TAB` to show the options and autocomplete the path, e.g. typing

`df = pd.read_csv('../input/')`

and then using `TAB` autocompletes the string for the path to

`df = pd.read_csv('../input/archive.csv')`

## 5. Examine the dataframe

### 5.1 Print the first 20 rows

Let's print out the first 20 rows of our dataframe by using the method `df.head()`.

In [None]:
df = pd.read_csv('../input/archive.csv')    # read csv file

df.head(20)    # Alternatively use print(df.head(20)), though 
               # on Kaggle you don't need to explicitly use print()

### 5.2 Print the number of rows and columns

We can also examine how many rows and columns our dataframe has by accessing the attribute `df.shape`.

In [None]:
print(df.shape)    # print number of rows and columns

print("The number of rows is: ", df.shape[0])
print("The number of columns is: ", df.shape[1])

### 5.3 Print the column names

We can print the column names of our dataframe with the attribute `df.columns`. For more information on Pandas DataFrames see:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

In [None]:
print(df.columns)

### 5.4 Print statistical information

We can print statistical information about our dataframe with the method `df.describe()`excluding NaN (not a number) values, see also the [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.describe.html#pandas.core.groupby.DataFrameGroupBy.describe).

In [None]:
df.describe()    # print statistical information

### 5.5 Summary of dataframe

We can use `dataframe.info()` to get a summary of the dataframe, see also

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html

In [None]:
df.info()

## 6. Save the notebook

You can save the notebook on Kaggle by clicking on the "Fast Forward" symbol (two black triangles) at the top which executes all cells and saves the notebook.

You can also download the notebook as an ipnb (iPython notebook) file which you can open in [Jupyter](http://jupyter.org/).

## 7. Publish your notebook

If you would like to share your notebook, then you can click on the **Publish** button at the top right corner. You can also change the status of your notebook afterwards to hide or delete it.