# Tutorial 1: Setup and Introduction to Python and Pandas
---

Jiawei Li

# Introduction

Some background about me:

-   Bachelor’s degree in Economics, didn’t believe in it, moved to data
    science.
-   Generalist or jack of all trades for all sorts of things related to
    data science, e.g. programming, machine learning, data engineering,
    data communication.
-   Too academic for industry and too industrial for academia.
-   Research interests: AI in Economics.
-   Professional experiences: Mostly in Finance, from stock exchange to
    high frequency market making, from DAX corporations to startups.

Tutorials help you gain principles and intuition that help you learn on your own.

# Setup

## Package Managers

A package manager is a software tool that automates the process of
installing, upgrading, and removing computer programs (incl. software,
applications, packages). 

I recommend using
[Winget](https://docs.microsoft.com/en-us/windows/package-manager/winget/)
for Windows, [Homebrew](https://brew.sh/) for macOS. For Linux and
Windows Subsystem for Linux, the choice of package managers usually
depends on which distribution you are using.

## Git

Git is a version control software. Forget about
`presentation_version_final.pptx`
`presentation_version_final_final.pptx` on your shared folders that
nobody can figure out which file to use. Git gives you the superpower to
track code changes and sync your work with your teammates. We start from
the very basics, which is the file system navigation and the `clone`
command.

## Virtual Environments

When you install a package into a virtual environment,
any packages you install are installed only in that environment. When
you then run a Python program within that environment, you know that
it’s running against only those specific packages.

## Mambaforge

Mambaforge is
basically [miniconda](https://docs.conda.io/en/latest/miniconda.html)
with the following features pre-configured:

-   [`conda-forge`](https://conda-forge.org/) set as the default (and
    only) channel to provide more updated and comprehensive coverage of
    packages.
-   [`Mamba`](https://github.com/mamba-org/mamba) in place of `conda` to
    provide better dependency solving and faster package installation.
    Even though I use `mamba` instead of `conda` throughout this
    tutorial, the command with `mamba` is the same as `conda`. You can
    refer to conda’s [cheat
    sheet](https://docs.conda.io/projects/conda/en/latest/_downloads/843d9e0198f2a193a3484886fa28163c/conda-cheatsheet.pdf)
    and just replace `conda` with `mamba`.

In [2]:
%load_ext watermark
%watermark --machine --python --packages numpy,pandas,sklearn

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
Python implementation: CPython
Python version       : 3.10.6
IPython version      : 8.5.0

numpy  : 1.23.3
pandas : 1.4.4
sklearn: 1.1.2

Compiler    : Clang 13.0.1 
OS          : Darwin
Release     : 21.6.0
Machine     : arm64
Processor   : arm
CPU cores   : 10
Architecture: 64bit




# Nouns and Verbs in Python

In an over-simplified fashion, all commands in Python can be understood
as a noun or a verb. A verb usually refers to some function to do
something, with lower case letters and underscores. Lots of these nouns
and verbs are created by other people, so you have to import them:

In [3]:
import someones_sick_project

In [4]:
someones_sick_project.say_hello()

Hello World!


In [5]:
my_bbq = someones_sick_project.BbqGrill(brand="Weber", model="E-330")
my_bbq

<someones_sick_project.BbqGrill at 0x145f9fca0>

In [6]:
import someones_sick_project as ssp

In [7]:
ssp.say_hello()

Hello World!


In [8]:
my_bbq = ssp.BbqGrill(brand="Weber", model="E-330")
my_bbq

<someones_sick_project.BbqGrill at 0x145fe46a0>

In [9]:
my_bbq.brand

'Weber'

In [10]:
my_bbq.model

'E-330'

In [11]:
my_bbq.grill()

🔥🔥🔥 Weber E-330 is on fire! 🔥🔥🔥


# Pandas

In [12]:
import pandas as pd

In [13]:
a_series = pd.Series([1, 2, 3])
a_series

0    1
1    2
2    3
dtype: int64

In [14]:
a_dataframe = pd.DataFrame({"col_1": [1, 2, 3], "col_2": [4.1, 5.4, 6.1]})
a_dataframe

Unnamed: 0,col_1,col_2
0,1,4.1
1,2,5.4
2,3,6.1


In [15]:
a_series.dtypes

dtype('int64')

In [16]:
a_series.index

RangeIndex(start=0, stop=3, step=1)

In [17]:
a_dataframe.dtypes

col_1      int64
col_2    float64
dtype: object

In [18]:
a_dataframe.index

RangeIndex(start=0, stop=3, step=1)

In [19]:
a_series.info()

<class 'pandas.core.series.Series'>
RangeIndex: 3 entries, 0 to 2
Series name: None
Non-Null Count  Dtype
--------------  -----
3 non-null      int64
dtypes: int64(1)
memory usage: 152.0 bytes


In [20]:
a_series.describe()

count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
dtype: float64

In [21]:
a_dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   col_1   3 non-null      int64  
 1   col_2   3 non-null      float64
dtypes: float64(1), int64(1)
memory usage: 176.0 bytes


In [22]:
a_dataframe.describe()

Unnamed: 0,col_1,col_2
count,3.0,3.0
mean,2.0,5.2
std,1.0,1.014889
min,1.0,4.1
25%,1.5,4.75
50%,2.0,5.4
75%,2.5,5.75
max,3.0,6.1


In [23]:
a_dataframe.head(1)

Unnamed: 0,col_1,col_2
0,1,4.1


Now, when you read pandas documentation, you should not be too confused
now. Let’s say, you are looking for
[`pandas.read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas-read-csv).

In [24]:
countries = pd.read_csv("https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv")
countries.head(5)

Unnamed: 0,Country,Region
0,Algeria,AFRICA
1,Angola,AFRICA
2,Benin,AFRICA
3,Botswana,AFRICA
4,Burkina,AFRICA


Here is the documentation for a very similar function,
[`pandas.DataFrame.to_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html).

In [25]:
# This doesn't work
# pd.to_csv("https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv")
# This works
countries.to_csv("countries.csv")