In [1]:
import numpy as np
import pandas as pd 

# Overview of pandas

**Pandas is suitable for handling the following types of data:**

* Tabular data with heterogeneous columns, similar to SQL or Excel tables.

* Time series data.

* Matrix data with labeled rows and columns.

* The main data structures in pandas are `Series` (one-dimensional data) and `DataFrame` (two-dimensional data). These two data structures are capable of handling the majority of typical use cases in fields such as finance, statistics, social sciences, engineering. Pandas is developed based on `NumPy` and seamlessly integrates with other third-party scientific computing libraries.

## Data Structure

Pandas two core Data Structures ：**Series** & **DataFrame**。 

|Dimension	  |  Name      |             Description       |  Data Table |
| :--:  |  :---:     |       :-------:          | :---:  |
|  1   |  Series    |    A labeled one-dimensional homogeneous array      |   a single column or row of data |
|  2   |	DataFrame|	A labeled and size-mutable two-dimensional heterogeneous table|  a sheet in Excel or a table in a database.|


- **Series**: Each element in the Series is associated with a label or index, allowing for easy and efficient access to the data. The Series can hold data of any type, such as numbers, strings, or dates.

- **DataFrame**:The DataFrame allows for flexible data manipulation and analysis, providing a tabular structure with labeled rows and columns. It is commonly used for storing and working with structured data, supporting operations such as indexing, filtering, merging, and statistical computations


## Check the version of Pandas

In [2]:
pd.__version__

'2.0.2'

# Data Type
## Main data type

- float
- int
- bool
- datetime64[ns]
- datetime64[ns, tz]
- timedelta64[ns]
- timedelta[ns]
- category
- object
- string 

Here is a mapping of the default data types in Pandas to their corresponding types in Python and NumPy:

| Pandas         | Python       | NumPy                                                          | 
| -------------- | ------------ | ------------------------------------------------------------   | 
| object         | str or mixed | string_, unicode_, mixed types                                 | 
| int64          | int          | int, int8, int16, int32, int64, uint8, uint16, uint32, uint64 | 
| float64        | float        | float, float16, float32, float64                              | 
| bool           | bool         | bool                                                          | 
| datetime64[ns] | nan          | datetime64[ns]                                                 | 
| timedelta[ns]  | nan          | nan                                                            | 
| category       | nan          | nan                                                            | 


# Series 

Series is a one-dimensional array with a name and an index. Unlike traditional arrays where all elements must have the same data type, Series allows elements of different data types. The data types that can be included in a Series can be integers, floats, strings, Python objects, and more. In this aspect, Series is similar to Python lists. However, unlike lists that can only be accessed using numerical indexing, Series allows values to be accessed using custom indexes. From this perspective, Series is similar to dictionaries.

## Generate series
```python
pandas.Series(data=None, index = None, dtype = None, name = None, copy = False)
```
**The explanation of the parameters:**

**data**: Supports various data types:

    - Python dictionary
    - Multi-dimensional array
    - Scalar value (e.g., 5)

**index**: The index, similar to an array or list. The values must be hashable and have the same length as "data". Non-unique index values are allowed. By default, it will be a RangeIndex (0, 1, 2, ..., n) if not provided. If both "index" and a dictionary are provided, the "index" will override the keys of the dictionary.

**dtype**: str, numpy.dtype, or ExtensionType, optional. The desired data type for the output Series. If not specified, it will be inferred from the "data".

**name**: The name of the Series.

**copy**: bool, default False. Specifies whether to make a copy of the input data or not.

### Generate a series using a list or a array

In [3]:
# Define a Series object
s = pd.Series(np.random.rand(5))
print("Series ojbect: \n", s)
print("Type of Series object:", type(s))

Series ojbect: 
 0    0.638583
1    0.576698
2    0.180556
3    0.977612
4    0.454322
dtype: float64
Type of Series object: <class 'pandas.core.series.Series'>


In [4]:
# Check the Series object index and values
print("Series index:", s.index, "\n type:", type(s.index))
print("Series values:", s.values, "\n type:", type(s.values))  # type is ndarray

Series index: RangeIndex(start=0, stop=5, step=1) 
 type: <class 'pandas.core.indexes.range.RangeIndex'>
Series values: [0.63858306 0.57669815 0.1805559  0.97761203 0.45432195] 
 type: <class 'numpy.ndarray'>


Generate a Series using a `ndarray`, Pandas automatically generates integer indexes by default.

In [5]:
arr = np.random.rand(5)
# Generate series object through ndarray
s1 = pd.Series(arr)  # Default index is 0, step is 1
s1

0    0.153039
1    0.471209
2    0.757999
3    0.293920
4    0.517502
dtype: float64

In [6]:
# Generate series object via ndarry, set up the index by a list ['a', 'b', 'c', 'd', 'e']
s2 = pd.Series(arr, index = ['a', 'b', 'c', 'd', 'e'], dtype = np.float32) # dtype to set the data type
s2

a    0.153039
b    0.471209
c    0.757999
d    0.293920
e    0.517502
dtype: float32

In [7]:
s2.index.name = "Index"  # Add a name to index
s2.name = "Random Numbers" # Add a name to series
s2

Index
a    0.153039
b    0.471209
c    0.757999
d    0.293920
e    0.517502
Name: Random Numbers, dtype: float32

### Generate a series using dictionary

The key of dictionary will be the index, the values of dictionary will be the values.

In [8]:
dic = {"a": 1, "b":2, "c":3, "d":4, "e": 5}
s3 = pd.Series(dic)
s3

a    1
b    2
c    3
d    4
e    5
dtype: int64

In [9]:
# Key must be the string, but if the values are string or other type?
dic1 = {"a": 1, "b": "Credit Risk Reporting", "c": 3.5, "d":4, "e": 5}
s4 = pd.Series(dic1)
s4    # date type has been changed to object

a                        1
b    Credit Risk Reporting
c                      3.5
d                        4
e                        5
dtype: object

### Generate a series using a scalar value

In [10]:
pd.Series(888, index = np.arange(5))

0    888
1    888
2    888
3    888
4    888
dtype: int64

## Properties of series
Common attributes of a Series include:
- shape: Series shape
- index: Series index
- values: Series values, ndarray type
- name: Series name
- dtype: Series data type

## Methods of series

### Basic methods
* head(n): View the first n elements of the Series. By default, n is 5.
* tail(n): View the last n elements of the Series. By default, n is 5.
* unique(): Return the unique elements in the Series.
* nunique(dropna=False): Return the number of unique elements in the Series. If dropna is False, it includes missing values.
* isna(): Check if the elements are missing values.
* isnull(): Same as isna().
* dropna(inplace=False): Remove missing values from the Series.
* isin(values): Check for membership, where values can be a set or list-like object. It checks if each element in the Series is a member of values.
* sort_index(ascending=False, inplace=False): Sort the index in ascending or descending order.
* sort_values(ascending=False, inplace=False): Sort the values in ascending or descending order.
* idxmax(): Return the index of the maximum value.
* idxmin(): Return the index of the minimum value.
* nlargest(n): Return the n largest values.
* nsmallest(n): Return the n smallest values.
* count(): Return the number of non-missing values.
* value_counts(ascending=False, dropna=True): Count the frequency of each value in the Series.
* clip(lower, upper, inplace=False): Truncate the Series values to be within the lower-upper range. Values below lower are replaced with lower, and values above upper are replaced with upper.
* replace(to_replace, value=None, inplace=False): Replace specified values in the Series.
* where(cond, other=np.nan, inplace=False): Replace elements in the Series that do not satisfy the condition with other.
* add_prefix(prefix): Add a prefix to the Series.
* add_suffix(suffix): Add a suffix to the Series.
* reset_index(drop=False, name=None, inplace=False): Reset the index. When drop is False, it returns a DataFrame.
* to_frame(name=None): Convert the Series to a DataFrame.
* append(to_append, ignore_index=False, verify_integrity=False): Concatenate two Series together.


## The operations for adding, deleting, modifying, and querying a Series:

### Access data elements:

There are three ways to access data elements in a Series:

* Indexing: Accessing elements by their index labels.
* Positional indexing: Accessing elements by their numerical position or index.
* Boolean indexing: Accessing elements based on a Boolean condition.

In [11]:
name = pd.Index(["Tom", "Bob", "Mary", "James"], name = "name")
user_age = pd.Series(data = [18, 30, 25, 40], index = name, name = "user_age_info")
user_age

name
Tom      18
Bob      30
Mary     25
James    40
Name: user_age_info, dtype: int64

### Access by index
Access for a specific element by index.

In [12]:
# Get Tom's age
user_age.Tom  # Access the property

18

In [13]:
user_age['Tom'] # Access using the dictionary

18

In [14]:
user_age.get('Tom')

18

In [15]:
# user_age['Kobe']  # Error
user_age.get('Kobe')  # get method without error

In [16]:
user_age.get('Kobe', default=30)  # Setting a default value for a non-existent key.

30

Access multiple elements using an array of indexes.

In [17]:
user_age[['Tom', 'James']]

name
Tom      18
James    40
Name: user_age_info, dtype: int64

### Accessing by position.

Access a specific element by position.

In [18]:
user_age[0]  # The first element

18

Access multiple elements using an array of positions.

In [19]:
user_age[[1,3]]  #  Retrieving data elements with positions 1 and 3.

name
Bob      30
James    40
Name: user_age_info, dtype: int64

Access multiple elements using positional slicing

In [20]:
user_age[:3]  # Retrieving the first three elements, i.e., the elements with positions 0, 1, and 2.

name
Tom     18
Bob     30
Mary    25
Name: user_age_info, dtype: int64

Boolean indexing

In [21]:
user_age[user_age > 30]  # Finding all elements with an age greater than 30.

name
James    40
Name: user_age_info, dtype: int64

### Adding data elements

In pandas, adding data elements is similar to working with dictionaries. We can directly use the syntax [index] = value to add a single element. If we want to add multiple elements, you can convert them into a Series and use the append method mentioned earlier to merge them together.

In [22]:
print("Before adding the elements:", user_age, '\n', '-'*30)
user_age['Wade'] = 39
user_age['Michael'] = 50
print("After adding the elements:", user_age)

Before adding the elements: name
Tom      18
Bob      30
Mary     25
James    40
Name: user_age_info, dtype: int64 
 ------------------------------
After adding the elements: name
Tom        18
Bob        30
Mary       25
James      40
Wade       39
Michael    50
Name: user_age_info, dtype: int64


### Modifying data elements

In pandas, to modify data elements in a Series, you can directly use the syntax Series[index/position] = new_value.

In [23]:
user_age['Wade'] = 38  # Modify by index

In [24]:
user_age

name
Tom        18
Bob        30
Mary       25
James      40
Wade       38
Michael    50
Name: user_age_info, dtype: int64

In [25]:
user_age[1] 

30

In [26]:
user_age[1] = 35  # Modify by positional index
user_age

name
Tom        18
Bob        35
Mary       25
James      40
Wade       38
Michael    50
Name: user_age_info, dtype: int64

### Deleting data elements

In pandas, to delete a data element from a Series, we can use the method del Series[index]. Alternatively, you can also use the built-in drop method to delete elements from the Series.

In [27]:
del user_age['Wade']

In [28]:
user_age

name
Tom        18
Bob        35
Mary       25
James      40
Michael    50
Name: user_age_info, dtype: int64

In [29]:
user_age.drop('Michael')   #  Will not change the original data.

name
Tom      18
Bob      35
Mary     25
James    40
Name: user_age_info, dtype: int64

In [30]:
user_age

name
Tom        18
Bob        35
Mary       25
James      40
Michael    50
Name: user_age_info, dtype: int64

In [31]:
user_age.drop('Michael', inplace = True)   # Change the original data.

In [32]:
user_age

name
Tom      18
Bob      35
Mary     25
James    40
Name: user_age_info, dtype: int64

In [33]:
user_age.drop(['Tom', 'Bob'], inplace = True)   # Drop multiple elements in one time
user_age

name
Mary     25
James    40
Name: user_age_info, dtype: int64