# Data science libraries - Chapter3 - LESSON 2. : Introduction to Pandas


Pandas is a python library that makes it easy to manipulate, analyse, clean and explore data.
The name "Pandas" refers to both "Panel Data" and "Python Data Analysis". 

## 1. The Series

A Pandas series is a labelled, one-dimensional array capable of holding any type of data (integers, strings, floating point numbers, Python objects, etc.). Axis labels are collectively called `index'. Here are some examples of how to create series:

In [None]:
import numpy as np
import pandas as pd

# Default index
data= [1, 7, 2]
a = pd.Series(data)
print(a)

# Labeled index
b = pd.Series(np.arange(0, 13, 3), index=["a", "b", "c", "d", "e"])
print(b)

# Create from dict
dico = {"b": 1, "a": 0, "c": 2}
c = pd.Series(dico)
print(c)

# Create from dict with base64
dico = {"a": 0.0, "b": 1.0, "c": 2.0}
d = pd.Series(dico)
print(d)

# Create from dict with index
e = pd.Series(dico, index=["b", "c", "d", "a"])
print(e)

# Create with fixed value
f = pd.Series(5.0, index=["a", "b", "c", "d", "e"])
print(f)


A serie acts very similarly to a `ndarray` and is a valid argument for most NumPy functions. However, operations such as slicing will also slice the index

In [None]:
s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])

print(s[0])

print(s[:3])

print(s[s > s.median()])

print(s[[4, 3, 1]])

print(np.exp(s))



A series is like a `dict' of fixed size in that you can get and set values by the index label:

In [None]:
print(s["a"])

s["e"] = 12.0
print(s)

print("e" in s)

print("f" in s)

The series have a name:

In [None]:
# Create serie with name
s = pd.Series(np.random.randn(5), name="something")
print(s)
print(s.name)

# Create new serie renamed
s2 = s.rename("different")
print(s2.name)


## 2. Dataframes
A `DataFrame` is a 2 dimensional labelled data structure with columns of potentially different types. You can think of it as a spreadsheet or an `SQL` table, or a `dict` of `Series` objects. This is the most commonly used pandas object. As with a `Series`, a `DataFrame` accepts many types of input:

 - Dict of 1D ndarrays, lists, dicts or series
 - numpy.ndarray 2D
 - structured ndarray
 - A series
 - Another DataFrame

In addition to the data, you can optionally pass index (row label) and column (column label) arguments.

In [None]:
# --- Build from dict of series ---
d = {
    "one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
    "two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"]),
}
df = pd.DataFrame(d)
print(df)

# Set indexes
df = pd.DataFrame(d, index=["d", "b", "a"])
print(df)

# Set columns
df = pd.DataFrame(d, index=["d", "b", "a"], columns=["two", "three"])
print(df)

# Show indexes and columns
print(df.index)
print(df.columns)


# --- Build from dict of lists ---
d = {"one": [1.0, 2.0, 3.0, 4.0], "two": [4.0, 3.0, 2.0, 1.0]}
df = pd.DataFrame(d)
print(df)

# Set indexes
df = pd.DataFrame(d, index=["a", "b", "c", "d"])
print(df)


# --- From csv ---
df = pd.read_csv("\file_path\data.csv")
print(df)

You can treat a `DataFrame' as a `dict' of indexed Series objects in the same way. Getting, setting and deleting columns works with the same syntax as the analogous `dict` operations:

In [None]:
# Show column "one"
print(df["one"])

# Create column "three"
df["three"] = df["one"] * df["two"]
print(df)

# Create column "flag"
df["flag"] = df["one"] > 2
print(df)

# Delete column "flag"
del df["flag"]
print(df)

# Get and delete column "three"
three = df.pop("three")
print(df)

# Set fixed value
df["foo"] = "bar"
print("df")

# Create column "one_trunc"
df["one_trunc"] = df["one"][:2]
print(df)

# Insert column "bar"
df.insert(1, "bar", df["one"])
print(df)


The `DataFrame` has an `assign()` method that allows you to easily create new columns potentially derived from existing columns:

In [None]:
dfa = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
dfa2 = dfa.assign(C=dfa["A"] + dfa["B"])
print(dfa2)

dfa3 = dfa.assign(C=lambda x: x["A"] + x["B"], D=lambda x: x["A"] + x["C"])
print(dfa3)

### Indexing and selection

The basics of indexing are as follows:

| Operation | Syntaxe | 
| --------- | ------ | 
| Select column | `df[col]` |
| Select row by integer location | `df.iloc[loc]` | 
| Slice rows | `df[5:10]` | 
| Select rows by boolean vector | `df[bool_vec]` | 

In [None]:
df = pd.DataFrame(np.random.randn(10, 4), columns=["A", "B", "C", "D"])

print(df["B"])
print(df.iloc[2])
print(df[5:10])

### Data alignment and arithmetic

Data alignment between `DataFrame` objects automatically aligns to the columns and index (row labels). Again, the resulting object will have the union of column and row labels.


In [None]:
df = pd.DataFrame(np.random.randn(10, 4), columns=["A", "B", "C", "D"])
df2 = pd.DataFrame(np.random.randn(7, 3), columns=["A", "B", "C"])

print(df + df2)

print(df - df.iloc[0])

print(df * 5 + 2)

print(1 / df)

print(df ** 4)

# Transpose
print(df.T)

# Sort
print(df.sort_values(by="B"))

# --- Boolean operators ---

df1 = pd.DataFrame({"a": [1, 0, 1], "b": [0, 1, 1]}, dtype=bool)
df2 = pd.DataFrame({"a": [0, 1, 1], "b": [1, 1, 0]}, dtype=bool)

print(df1 & df2)

print(df1 | df2)

print(df1 ^ df2)

print(-df1)


## 3. Display

Very large `DataFrames` will be truncated for display in the console. You can also get a summary using `info()`.

In [None]:
df = pd.DataFrame(np.random.randn(200, 4), columns=["A", "B", "C", "D"])

print(df)

df.info()

df.head()
df.head(10)

df.tail()

df.describe()