# Basic commands for pandas
- Just like a library in R, "pandas" is a popular open-source data manipulation and analysis library. It provides easy-to-use data structures and functions for efficiently working with structured data.
- A 10-minuts to pandas, covering nearly everything we need. https://pandas.pydata.org/docs/user_guide/10min.html
- Ask ChatGPT

To use pandas in your Python environment, you need to install it first using the following command:
- remove ! in front to install

In [1]:
!pip install pandas



#### Once installed, you can import it in your Python scripts or Jupyter notebooks using:

In [2]:
import pandas as pd 

- pd is therefore just a shortcut pandas, you can choose what you like.
- we can use it to make a simple data frame

In [3]:
df_sample = pd.DataFrame(
    {
        "Var_1": 1.0,
        "Var_2": pd.Categorical(["test", "train", "test", "train"]),
        "this_is_my_name": "foo",
    }
)
print(df_sample)

   Var_1  Var_2 this_is_my_name
0    1.0   test             foo
1    1.0  train             foo
2    1.0   test             foo
3    1.0  train             foo


#### We need to check the data type before doing any numerical exercise
- check python data type here: https://www.w3schools.com/python/python_datatypes.asp

In [4]:
df_sample.dtypes

Var_1               float64
Var_2              category
this_is_my_name      object
dtype: object

In [5]:
df_sample['Var_1'] + 3

0    4.0
1    4.0
2    4.0
3    4.0
Name: Var_1, dtype: float64

#### Check data dimension

In [6]:
# the dimension of the data set
df_sample.shape

(4, 3)

#### For illustration purpose, I will use a publicly available dataset from a online Monash python class

In [7]:
df = pd.read_csv("https://monashdatafluency.github.io/python-workshop-base/modules/data/surveys.csv")

#### Viewing data

In [8]:
df.head(2)

Unnamed: 0,record_id,month,day,year,site_id,species_id,sex,hindfoot_length,weight
0,1,7,16,1977,2,NL,M,32.0,
1,2,7,16,1977,3,NL,M,33.0,


In [9]:
df.tail(2)

Unnamed: 0,record_id,month,day,year,site_id,species_id,sex,hindfoot_length,weight
35547,35548,12,31,2002,7,DO,M,36.0,51.0
35548,35549,12,31,2002,5,,,,


In [10]:
df.columns

Index(['record_id', 'month', 'day', 'year', 'site_id', 'species_id', 'sex',
       'hindfoot_length', 'weight'],
      dtype='object')

#### Shows a quick statistic summary of your data:

In [11]:
df.describe()

Unnamed: 0,record_id,month,day,year,site_id,hindfoot_length,weight
count,35549.0,35549.0,35549.0,35549.0,35549.0,31438.0,32283.0
mean,17775.0,6.474022,16.105966,1990.475231,11.397001,29.287932,42.672428
std,10262.256696,3.396583,8.256691,7.493355,6.799406,9.564759,36.631259
min,1.0,1.0,1.0,1977.0,1.0,2.0,4.0
25%,8888.0,4.0,9.0,1984.0,5.0,21.0,20.0
50%,17775.0,6.0,16.0,1990.0,11.0,32.0,37.0
75%,26662.0,9.0,23.0,1997.0,17.0,36.0,48.0
max,35549.0,12.0,31.0,2002.0,24.0,70.0,280.0


#### Selecting a single column, which yields a Series

In [12]:
df["weight"]

0         NaN
1         NaN
2         NaN
3         NaN
4         NaN
         ... 
35544     NaN
35545     NaN
35546    14.0
35547    51.0
35548     NaN
Name: weight, Length: 35549, dtype: float64

#### Get some rows

In [13]:
df[0:3]

Unnamed: 0,record_id,month,day,year,site_id,species_id,sex,hindfoot_length,weight
0,1,7,16,1977,2,NL,M,32.0,
1,2,7,16,1977,3,NL,M,33.0,
2,3,7,16,1977,2,DM,F,37.0,


#### Selecting by label

In [14]:
df.loc[:, ["day", "site_id"]]

Unnamed: 0,day,site_id
0,16,2
1,16,3
2,16,2
3,16,7
4,16,3
...,...,...
35544,31,15
35545,31,15
35546,31,10
35547,31,7


#### Performing a descriptive statistic:

In [15]:
df.mean()

record_id          17775.000000
month                  6.474022
day                   16.105966
year                1990.475231
site_id               11.397001
hindfoot_length       29.287932
weight                42.672428
dtype: float64

In [16]:
df.mean(1) # calculate row mean

0         339.166667
1         339.666667
2         340.333333
3         341.166667
4         340.500000
            ...     
35544    7521.000000
35545    7521.200000
35546    5375.857143
35547    5383.857143
35548    7519.800000
Length: 35549, dtype: float64

In [17]:
df.month.mean()

6.474021772764353