# Data science libraries - Chapter3 - LESSON 2. : Introduction to Pandas


Pandas is a python library that makes it easy to manipulate, analyse, clean and explore data.
The name "Pandas" refers to both "Panel Data" and "Python Data Analysis". 

## 1. The Series

A Pandas series is a labelled, one-dimensional array capable of holding any type of data (integers, strings, floating point numbers, Python objects, etc.). Axis labels are collectively called `index'. Here are some examples of how to create series:

In [18]:
import numpy as np
import pandas as pd

# Default index
#data= [1, 7, 2]
#a = pd.Series(data)
#print(a)

#Labeled index
#b = pd.Series(np.arange(0, 13, 3), index=["a", "b", "c", "d", "e"])
#print(b)

# Create from dict
dico = {"b": 1, "a": 0, "c": 2}
c = pd.Series(dico)
print(c)

# Create from dict with base64
dico = {"a": 0.0, "b": 1.0, "c": 2.0}
d = pd.Series(dico)
print(d)

# Create from dict with index
e = pd.Series(dico, index=["b", "c", "d", "a"])
print(e)

# Create with fixed value
f = pd.Series(5.0, index=["a", "b", "c", "d", "e"])
print(f)


b    1
a    0
c    2
dtype: int64
a    0.0
b    1.0
c    2.0
dtype: float64
b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64
a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64


A serie acts very similarly to a `ndarray` and is a valid argument for most NumPy functions. However, operations such as slicing will also slice the index

In [20]:
s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])

print(s[0])

print(s[:3])

print(s[s > s.median()])

print(s[[4, 3, 1]])

print(np.exp(s))



-0.3197918065462681
a   -0.319792
b   -0.885572
c   -0.744806
dtype: float64
d    1.761166
e    0.063729
dtype: float64
e    0.063729
d    1.761166
b   -0.885572
dtype: float64
a    0.726300
b    0.412478
c    0.474827
d    5.819219
e    1.065804
dtype: float64


A series is like a `dict' of fixed size in that you can get and set values by the index label:

In [21]:
print(s["a"])

s["e"] = 12.0
print(s)

print("e" in s)

print("f" in s)

-0.3197918065462681
a    -0.319792
b    -0.885572
c    -0.744806
d     1.761166
e    12.000000
dtype: float64
True
False


The series have a name:

In [22]:
# Create serie with name
s = pd.Series(np.random.randn(5), name="something")
print(s)
print(s.name)

# Create new serie renamed
s2 = s.rename("different")
print(s2.name)


0    0.649858
1   -1.156265
2    0.295671
3    1.312554
4   -2.965602
Name: something, dtype: float64
something
different


## 2. Dataframes
A `DataFrame` is a 2 dimensional labelled data structure with columns of potentially different types. You can think of it as a spreadsheet or an `SQL` table, or a `dict` of `Series` objects. This is the most commonly used pandas object. As with a `Series`, a `DataFrame` accepts many types of input:

 - Dict of 1D ndarrays, lists, dicts or series
 - numpy.ndarray 2D
 - structured ndarray
 - A series
 - Another DataFrame

In addition to the data, you can optionally pass index (row label) and column (column label) arguments.

In [24]:
d = {
    "one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
    "two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"]),
}
df = pd.DataFrame(d)
print(df)

   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0


In [25]:
# Set indexes
df = pd.DataFrame(d, index=["d", "b", "a"])
print(df)

   one  two
d  NaN  4.0
b  2.0  2.0
a  1.0  1.0


In [None]:
# Set columns
#df = pd.DataFrame(d, index=["d", "b", "a"], columns=["two", "three"])
#print(df)

In [None]:
# Show indexes and columns
#print(df.index)
#print(df.columns)

In [None]:
# --- Build from dict of lists ---
#d = {"one": [1.0, 2.0, 3.0, 4.0], "two": [4.0, 3.0, 2.0, 1.0]}
#df = pd.DataFrame(d)
#print(df)

In [None]:
# Set indexes
#df = pd.DataFrame(d, index=["a", "b", "c", "d"])
#print(df)

In [23]:
# --- Build from dict of series ---
d = {
    "one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
    "two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"]),
}
df = pd.DataFrame(d)
print(df)

# Set indexes
#df = pd.DataFrame(d, index=["d", "b", "a"])
#print(df)

# Set columns
#df = pd.DataFrame(d, index=["d", "b", "a"], columns=["two", "three"])
#print(df)

# Show indexes and columns
#print(df.index)
#print(df.columns)


# --- Build from dict of lists ---
#d = {"one": [1.0, 2.0, 3.0, 4.0], "two": [4.0, 3.0, 2.0, 1.0]}
#df = pd.DataFrame(d)
#print(df)

# Set indexes
#df = pd.DataFrame(d, index=["a", "b", "c", "d"])
#print(df)


# --- From csv ---
#df = pd.read_csv("\file_path\data.csv")
#print(df)

   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0


You can treat a `DataFrame' as a `dict' of indexed Series objects in the same way. Getting, setting and deleting columns works with the same syntax as the analogous `dict` operations:

In [7]:
# Show column "one"
print(df["one"])

# Create column "three"
df["three"] = df["one"] * df["two"]
print(df)

# Create column "flag"
df["flag"] = df["one"] > 2
print(df)

# Delete column "flag"
del df["flag"]
print(df)

# Get and delete column "three"
three = df.pop("three")
print(df)

# Set fixed value
df["foo"] = "bar"
print("df")

# Create column "one_trunc"
df["one_trunc"] = df["one"][:2]
print(df)

# Insert column "bar"
df.insert(1, "bar", df["one"])
print(df)


a    1.0
b    2.0
c    3.0
d    4.0
Name: one, dtype: float64
   one  two  three
a  1.0  4.0    4.0
b  2.0  3.0    6.0
c  3.0  2.0    6.0
d  4.0  1.0    4.0
   one  two  three   flag
a  1.0  4.0    4.0  False
b  2.0  3.0    6.0  False
c  3.0  2.0    6.0   True
d  4.0  1.0    4.0   True
   one  two  three
a  1.0  4.0    4.0
b  2.0  3.0    6.0
c  3.0  2.0    6.0
d  4.0  1.0    4.0
   one  two
a  1.0  4.0
b  2.0  3.0
c  3.0  2.0
d  4.0  1.0
df
   one  two  foo  one_trunc
a  1.0  4.0  bar        1.0
b  2.0  3.0  bar        2.0
c  3.0  2.0  bar        NaN
d  4.0  1.0  bar        NaN
   one  bar  two  foo  one_trunc
a  1.0  1.0  4.0  bar        1.0
b  2.0  2.0  3.0  bar        2.0
c  3.0  3.0  2.0  bar        NaN
d  4.0  4.0  1.0  bar        NaN


The `DataFrame` has an `assign()` method that allows you to easily create new columns potentially derived from existing columns:

In [10]:
dfa = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
print(dfa)
dfa2 = dfa.assign(C=dfa["A"] + dfa["B"])
print(dfa2)



   A  B
0  1  4
1  2  5
2  3  6
   A  B  C
0  1  4  5
1  2  5  7
2  3  6  9


### Indexing and selection

The basics of indexing are as follows:

| Operation | Syntaxe | 
| --------- | ------ | 
| Select column | `df[col]` |
| Select row by integer location | `df.iloc[loc]` | 
| Slice rows | `df[5:10]` | 
| Select rows by boolean vector | `df[bool_vec]` | 

In [11]:
df = pd.DataFrame(np.random.randn(10, 4), columns=["A", "B", "C", "D"])

print(df["B"])
print(df.iloc[2])
print(df[5:10])

0    0.118028
1   -1.330478
2   -0.215657
3   -0.100419
4    2.527934
5    0.395523
6    1.088923
7   -0.259393
8   -1.370819
9    1.099071
Name: B, dtype: float64
A    0.368528
B   -0.215657
C    0.544586
D   -1.117612
Name: 2, dtype: float64
          A         B         C         D
5 -0.649084  0.395523 -0.168184 -0.637145
6  0.895472  1.088923 -1.369791 -1.997235
7 -1.633526 -0.259393 -0.975429  1.605434
8  1.502611 -1.370819  0.419629 -0.171532
9  0.423988  1.099071 -1.091935  1.077987


### Data alignment and arithmetic

Data alignment between `DataFrame` objects automatically aligns to the columns and index (row labels). Again, the resulting object will have the union of column and row labels.


In [None]:
df = pd.DataFrame(np.random.randn(10, 4), columns=["A", "B", "C", "D"])
df2 = pd.DataFrame(np.random.randn(7, 3), columns=["A", "B", "C"])

print(df + df2)

print(df - df.iloc[0])

print(df * 5 + 2)

print(1 / df)

print(df ** 4)

# Transpose
print(df.T)

# Sort
print(df.sort_values(by="B"))

# --- Boolean operators ---

df1 = pd.DataFrame({"a": [1, 0, 1], "b": [0, 1, 1]}, dtype=bool)
df2 = pd.DataFrame({"a": [0, 1, 1], "b": [1, 1, 0]}, dtype=bool)

print(df1 & df2)

print(df1 | df2)

print(df1 ^ df2)

print(-df1)


## 3. Display

Very large `DataFrames` will be truncated for display in the console. You can also get a summary using `info()`.

In [12]:
df = pd.DataFrame(np.random.randn(200, 4), columns=["A", "B", "C", "D"])

print(df)

df.info()

df.head()
df.head(10)

df.tail()

df.describe()

            A         B         C         D
0    0.098102 -0.376757 -0.229182  0.062517
1    1.676983  0.909373 -1.486101 -0.117991
2   -1.750117  1.112295  0.211920 -0.435388
3    0.275808 -0.852016 -0.555726  0.631950
4   -0.533483 -1.325208  1.610677 -1.790264
..        ...       ...       ...       ...
195 -0.501949 -1.770592 -0.767374  0.041434
196  0.264772 -0.965123 -0.324014  1.512960
197  0.767783  0.810559 -0.143663  0.123598
198  0.803980  0.805160 -0.420896 -0.511690
199  0.081760 -0.801646  0.282531 -0.211037

[200 rows x 4 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       200 non-null    float64
 1   B       200 non-null    float64
 2   C       200 non-null    float64
 3   D       200 non-null    float64
dtypes: float64(4)
memory usage: 6.4 KB


Unnamed: 0,A,B,C,D
count,200.0,200.0,200.0,200.0
mean,0.050271,0.042125,0.021389,0.021181
std,0.962852,0.989074,1.025319,1.064046
min,-2.227199,-2.481196,-3.330833,-3.180062
25%,-0.614664,-0.605589,-0.576633,-0.70404
50%,0.102322,0.094986,0.009193,0.074263
75%,0.730806,0.694543,0.694235,0.717109
max,2.804095,3.391807,2.374742,3.066232
