# Lecture 02: Basic Jupyter Notebooks 

<font size="4"> 

The basic structure for running Python for data projects
<img src="figures/project_flow.png" alt="drawing" width="800"/>
- Python is a general purpose language
- Researchers and practitioners add new functionalities all the time
- New features are included as libraries on top of the "basic" installation

***

## Preliminaries 

<font size="4"> 

- A Virtual Environment is a **directory** (folder in your computer) <br>
that contains a specific **collection of packages**  <br>
This is the *.venv* when you create or open a PyCharm project.

- A package is a folder containing a set of Python scripts or <br>
modules which allow you to accomplish a defined task <br> 
(visualisation, analysis, mathematical operations, etc.)

## Setup Working Environment

<font size="4"> 

![](figures/python_kernel.png)

- If you see Python 3 (ipykernel), you are good to go. <br>
- Otherwise, go to Kernel -> Change Kernel -> Select Python 3 (ipykernel).


<font size = "4">

(a) Import Packages:

- Jupyter notebooks launches with very basic options
- Install packages using the PyCharm Python Interpreter covered last lecture

 For recent version of PyCharm, there is a shortcut to install packages:
 <img src="figures/pycharm_shortcut_package.png" alt="drawing" width="800"/>

>  Click "Python packages" icon ->
>  Search for the package ->
>  Specify the version number ->
>  Click "install". 

Please wait until the progress bar on the bottom right is clear. You can refresh the Jupyter notebook if you do not see the changes immediately.

- The "import" command adds libraries to the working enviroment. 
- We can give the libraries a nickname with "as"



```matplotlib``` allows us to do nice graphs in Python <br>
``` pandas ``` allows us to work with datasets

### Practice:
<font size = "4">
    
- <span style="color: red;">Try installing a new pacakge *scipy* using the shortcut approach.</span>
    

In [1]:
# Notes about nicknames:
# - For example, "matplotlib.pyplot" is a long name. Let's call it "plt"
# - Similarly, let's call "pandas" as "pd"
# - Try adding your own nickname!
# - To avoid errors, be consistent with your nicknames

import matplotlib.pyplot as plt
import pandas as pd

ModuleNotFoundError: No module named 'pandas._libs.groupby'

### Practice
<font size = "4">
- <span style="color:red">Practice: Try importing *scipy* </span>

In [None]:
import scipy

<font size="4"> 

(b) Open datasets

Run the command "read_csv" from the library <br>
"pandas" (nicknamed "pd"). 


In [None]:
print('Hello, World!')

In [None]:
# You can use "." to run subcommands contained in a library.
# The subcommand "read_csv()" opens the file in parenthesis.
# We use the "=" symbol to store the data in the working environment under the name "carfeatures"


carfeatures = pd.read_csv('features.csv')
carfeatures.head()

<font size="4"> 

You can open the datasets in the current environment and display the the first five rows of the dataset using *.head()* function. 

Notice that the index of Python starts from 0. We will talk about those details in the future.


### Practice
<font size='4'>
<span style="color:red"> - Print out the first five rows using .head() function </span>

***


## STEP 2: Run Analyses

<font size="4"> 

Output data for all the columns

In [None]:
# Entering the name of a dataframe produces an output (with the first five and last five rows)

carfeatures

<font size="4"> 

Output data for a single column 'cylinders'

In [None]:
# We use square brackets [...] to subset information from data 
# Text/strings have to be written in quotation marks
# This command extracts the column 'cylinders'

carfeatures['cylinders']


<font size="4"> 

Example: Compute a frequency table

In [None]:
# crosstab counts how many rows fall into categories
# "index" is the category
# "columns" is a custom title

table = pd.crosstab(index = carfeatures['cylinders'], columns = "count")
table

In [None]:
table.columns.name

In [None]:
table.columns.name = 'column name'
table

### Practice
<font size='4'>
<span style='color:red'> Try the command again but this time change the column name to "frequency table". </span>


In [None]:
table.columns.name = 'frequency table'
table

<font size="4"> 

Example: Compute basic summary statistics for all variables

In [None]:
# "describe" computes the count, mean, std, min, 25% quantile, 50%, 75%, max
# automatically excludes variables with text values
# otherwise includes all numeric variables

carfeatures.describe()

<font size="5"> 

Example: Display a scatter plot 

In [None]:
plt.scatter(x = carfeatures['weight'], y = carfeatures['mpg'])
plt.show()

### Practice
<font size='4'>
- <span style='color:red'>Try another scatter plot with x = "acceleration" </span>



In [None]:
plt.scatter(x = carfeatures['acceleration'], y = carfeatures['mpg'])
plt.show()