In [1]:
import numpy as np
import pandas as pd

In [2]:
print(pd.__version__)

2.0.2


# Creating a DataFrame

A DataFrame is the most commonly used Pandas object. Similar to Series, a DataFrame supports various types of input data:

* Method 1: Dictionary of Series
* Method 2: Dictionary of Lists*
* Method 3: List of Dictionaries*
* Method 4: List of Lists (2D array)
* Method 5: DataFrame.from_dict
* Method 6: DataFrame.from_record
* Method 7: Reading from files

These methods provide flexibility in creating a DataFrame from different types of data sources.

## Method 1: Dictionary of series

```python
data = {  
    "col1": pd.Series([],index),  
    "col2": pd.Series([],index),  
    "col3": pd.Series([],index)  
 } 
 ```
 
* The keys of the dictionary represent column names.
* The values of the dictionary can be Series objects.
* If the DataFrame does not specify index or columns, the keys of the dictionary will be used as columns, and the indexes of the Series will be used as the index of the DataFrame.
* If the DataFrame specifies index or columns, it will filter the corresponding keys from the dictionary based on the specified columns and the corresponding indexes from the Series based on the specified index. The filtered results will then be placed into the generated DataFrame.

In [3]:
data = {
    "name": pd.Series(["Tom", "Bob", "Mary", "James", "Aria"]),
    "age": pd.Series([18, 30, 25, 40, 22]),
    "city": pd.Series(["Los Angeles", "New York", "Chicago", "Miami", "Brambleton"])
}
user_info = pd.DataFrame(data)
user_info

Unnamed: 0,name,age,city
0,Tom,18,Los Angeles
1,Bob,30,New York
2,Mary,25,Chicago
3,James,40,Miami
4,Aria,22,Brambleton


**Set the name as index**

In [4]:
index = ["Tom", "Bob", "Mary", "James", "Aria"]
data = {
    "age": pd.Series([18, 30, 25, 40, 22], index = index),
    "city": pd.Series(["Los Angeles", "New York", "Chicago", "Miami", "Brambleton"], index = index)
}
user_info = pd.DataFrame(data)             
user_info                      

Unnamed: 0,age,city
Tom,18,Los Angeles
Bob,30,New York
Mary,25,Chicago
James,40,Miami
Aria,22,Brambleton


**When both the Series and the DataFrame specify indexes, the following will occur:**

The DataFrame will first check if the DataFrame specified index exists in the Series.

If the index exists in the Series, the corresponding values will be selected and assigned to the DataFrame at the corresponding index.

If the index does not exist in the Series, the corresponding entry in the DataFrame will be filled with NaN (missing value) for that index.

In summary, when both the Series and the DataFrame have specified indexes, the DataFrame will align the values from the Series based on the matching indexes and populate the DataFrame accordingly. Any indexes that do not have a corresponding value in the Series will result in NaN values in the DataFrame.

In [5]:
index = ["Tom", "Bob", "Mary", "James", "Aria"]
data = {
    "age": pd.Series([18, 30, 25, 40, 22], index = index),
    "city": pd.Series(["Los Angeles", "New York", "Chicago", "Miami", "Brambleton"], index = index)
}
user_info = pd.DataFrame(data, index = ["Tom", "James", "Aria"])
user_info

Unnamed: 0,age,city
Tom,18,Los Angeles
James,40,Miami
Aria,22,Brambleton


In [6]:
data = {
    "age": pd.Series([18, 30, 25, 40, 22]),
    "city": pd.Series(["Los Angeles", "New York", "Chicago", "Miami", "Seatle"])
}
user_info = pd.DataFrame(data, index = ["Tom", "Bob", "Mary", "James", "Aria"])
user_info

Unnamed: 0,age,city
Tom,,
Bob,,
Mary,,
James,,
Aria,,


In [7]:
data['age'].index  # The column age will be following the index from 0 to 5 with step equals to 1

RangeIndex(start=0, stop=5, step=1)

In [8]:
user_info.index  # User_info index in the index list assigned, but values are null.

Index(['Tom', 'Bob', 'Mary', 'James', 'Aria'], dtype='object')

**Set the one column as index:**

In [9]:
data = {
    "age": pd.Series([18, 30, 25, 40, 22]),
    "city": pd.Series(["Los Angeles", "New York", "Chicago", "Miami", "Seatle"])
}
user_info = pd.DataFrame(data)
# Set the name column as index, use index method
user_info.index = ['Tom', 'Bob', 'Mary', 'James', 'Aria'] 
user_info

Unnamed: 0,age,city
Tom,18,Los Angeles
Bob,30,New York
Mary,25,Chicago
James,40,Miami
Aria,22,Seatle


Similar to how indexes are handled, when specifying columns for the DataFrame, there are a few considerations:

* The DataFrame will use the specified columns as the definitive set of columns.
* It will search for columns in the data that have the same names as the specified columns.
* Only the columns with matching names will be included in the DataFrame.
* Columns in the data that do not have matching names with the specified columns will be ignored.

In essence, specifying columns in the DataFrame acts as a filter to select and include only the columns from the data that match the specified columns, resulting in a DataFrame with the desired columns.

In [10]:
data = {
    "age": pd.Series([18, 30, 25, 40, 22]),
    "city": pd.Series(["Los Angeles", "New York", "Chicago", "Miami", "Brambleton"])
}
user_info = pd.DataFrame(data, columns = ['city', 'hobby'])
user_info

Unnamed: 0,city,hobby
0,Los Angeles,
1,New York,
2,Chicago,
3,Miami,
4,Brambleton,


## Method 2: Dictionary of Lists*

In [11]:
data  = {
    "name": ["Tom", "Bob", "Mary", "James", "Aria"],
    "age" : [18, 30, 25, 40, 22],
    "city" : ["Los Angeles", "New York", "Chicago", "Miami", "Brambleton"]
}
user_info = pd.DataFrame(data = data)
user_info

Unnamed: 0,name,age,city
0,Tom,18,Los Angeles
1,Bob,30,New York
2,Mary,25,Chicago
3,James,40,Miami
4,Aria,22,Brambleton


## Method 3:  List of Dictionaries*

In [12]:
data = [
    {'name':'Tom','age':18,'city': "Los Angeles"},
    {'name':'Bob','age': 30, 'city': 'New York'},
    {'name':'Mary','age': 25, 'city': 'Chicago'},
    {'name':'James','age': 40, 'city': 'Miami'}
]
user_info = pd.DataFrame(data)
user_info

Unnamed: 0,name,age,city
0,Tom,18,Los Angeles
1,Bob,30,New York
2,Mary,25,Chicago
3,James,40,Miami


In [13]:
user_info.set_index('name')

Unnamed: 0_level_0,age,city
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Tom,18,Los Angeles
Bob,30,New York
Mary,25,Chicago
James,40,Miami


## Method 4:  List of Lists (2D array)

In [14]:
data = [['Tom',18, "Los Angeles"],
        ['Bob', 30, "New York"],
        ['Mary', 25, "Chicago"],
        ['James', 40, "Miami"]]
columns = ['name', 'age', 'city']
user_info = pd.DataFrame(data, columns=columns)
user_info

Unnamed: 0,name,age,city
0,Tom,18,Los Angeles
1,Bob,30,New York
2,Mary,25,Chicago
3,James,40,Miami


## Method 5: DataFrame.from_dict

The ``from_dict()`` method in DataFrame is used to convert a dictionary into a DataFrame object. This method accepts the following parameters:

* data: A dictionary or an array-like object to create the DataFrame.
* orient: The orientation of the data. It allows values of "columns" or "index", with the default value being "columns".
* columns: When the orientation is set to "index", it represents a list of values to be used as labels for the DataFrame. If used together with the column orientation, a ValueError will be raised.

In summary, `from_dict()` allows you to create a DataFrame from a dictionary, specifying the orientation of the data and the labels for columns (if applicable).

In [15]:
data  = {
    "name": ["Tom", "Bob", "Mary", "James", "Aria"],
    "age" : [18, 30, 25, 40, 22],
    "city" : ["Los Angeles", "New York", "Chicago", "Miami", "Brambleton"]
}
user_info = pd.DataFrame.from_dict(data)
user_info

Unnamed: 0,name,age,city
0,Tom,18,Los Angeles
1,Bob,30,New York
2,Mary,25,Chicago
3,James,40,Miami
4,Aria,22,Brambleton


In [16]:
user_info = pd.DataFrame.from_dict(data, orient = 'index')
user_info

Unnamed: 0,0,1,2,3,4
name,Tom,Bob,Mary,James,Aria
age,18,30,25,40,22
city,Los Angeles,New York,Chicago,Miami,Brambleton


In [17]:
user_info = pd.DataFrame.from_dict(data, orient='index', 
            columns = ['Student 1', 'Student 2', 'Student 3', 'Student 4', 'Student 5'])
user_info

Unnamed: 0,Student 1,Student 2,Student 3,Student 4,Student 5
name,Tom,Bob,Mary,James,Aria
age,18,30,25,40,22
city,Los Angeles,New York,Chicago,Miami,Brambleton


## Method 6: DataFrame.from_record

The ```from_records()``` method in DataFrame is used to convert structured data or arrays into a DataFrame object. This method accepts the following parameters:

* data: A structured data array, tuple, or a sequence of dictionaries or DataFrame structured input data.
* index: Index labels.
* exclude: A sequence of columns or indexes to exclude.
* columns: A sequence of column names.
* coerce_float: A boolean value (default False) to force conversion to float data type.
* nrows: An integer value (default None) representing the number of rows to read if the data is an iterator.

In summary, ```from_records()``` allows you to create a DataFrame from structured data or arrays, providing options for specifying index, columns, excluding columns, and controlling data types.

In [18]:
data  = {
    "name": ["Tom", "Bob", "Mary", "James", "Aria"],
    "age" : [18, 30, 25, 40, 22],
    "city" : ["Los Angeles", "New York", "Chicago", "Miami", "Brambleton"]
}
user_info = pd.DataFrame.from_records(data, exclude = ['age'])
user_info

Unnamed: 0,city,name
0,Los Angeles,Tom
1,New York,Bob
2,Chicago,Mary
3,Miami,James
4,Brambleton,Aria


## Method 7: Reading from files

In data analysis, reading and writing operations are common tasks. Pandas provides several API methods for I/O operations. Below are a few commonly used methods for reading data, all of which return a DataFrame:

* read_csv: Reads data from a CSV file.
* read_excel: Reads data from an Excel file.
* read_html: Reads data from an HTML file.
* read_sql: Reads data from a SQL database.
* read_json: Reads data from a JSON file.

These methods allow you to read data from different file formats and databases into a DataFrame, enabling further analysis and manipulation using Pandas' powerful functionalities.

# DataFrame Properties

The basic properties of a DataFrame are as follows:

* ```df.shape```: View the shape (number of rows and columns) of the DataFrame.

* ```df.columns```: View the column names of the DataFrame.

* ```df.index```: View the index labels of the DataFrame.

* ```df.dtypes```: View the data types of each column in the DataFrame.

* ```df.ndim```: View the number of dimensions (2 for DataFrame) of the DataFrame.

* ```df.T```: Transpose the DataFrame (swap rows and columns).

* ```df.values```: Convert the DataFrame into an array.

These properties allow you to examine and retrieve important information about the structure, dimensions, and content of a DataFrame, facilitating data exploration and analysis tasks.

# Data Frame Methods

## Basic Methods

* ```df.head(n)```: View the first n rows of data. By default, n is 5.

* ```df.tail(n)```: View the last n rows of data. By default, n is 5.

* ```df.info()```: View the overall information about the DataFrame.

* ```df.select_dtypes(include=['float64'])```: Select columns of specific data types.

In [19]:
user_info.head(3)

Unnamed: 0,city,name
0,Los Angeles,Tom
1,New York,Bob
2,Chicago,Mary


In [20]:
user_info.tail(2)

Unnamed: 0,city,name
3,Miami,James
4,Brambleton,Aria


In [21]:
user_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   city    5 non-null      object
 1   name    5 non-null      object
dtypes: object(2)
memory usage: 208.0+ bytes


In [22]:
user_info.select_dtypes(include = 'object')

Unnamed: 0,city,name
0,Los Angeles,Tom
1,New York,Bob
2,Chicago,Mary
3,Miami,James
4,Brambleton,Aria


## Description and Statistics

**Description and statistics of the population**

* ```df.describe()```: View an overall summary of the numeric columns in the DataFrame.

* ```df.describe(include=['object'])```: View an overall summary of the non-numeric columns in the DataFrame.

* ```df.groupby(col1)[col2].count()```: Perform a grouping operation on col1 and count the occurrences of col2 within each group.

**Single Column Description and Statistics**: To analyze a single column, we can use ```df[col]``` to select the column data, which will be of type Series. You can then apply the methods introduced in the previous section for Series.

* ```df[col].value_counts()```: Count the occurrences of each value in a specific column, similar to grouping.
* ```df[col].sum()```: Compute the sum of the values in a specific column.

In [23]:
data  = {
    "age" : [18, 30, 25, 40, 22],
    "city" : ["Los Angeles", "New York", "Chicago", "Miami", "Brambleton"],
    'sex': ['Male', 'Male', 'Female', 'Male', 'Female']
}
index = pd.Index( ["Tom", "Bob", "Mary", "James", "Aria"], name = 'name')
user_info = pd.DataFrame(data = data, index = index)
user_info

Unnamed: 0_level_0,age,city,sex
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Tom,18,Los Angeles,Male
Bob,30,New York,Male
Mary,25,Chicago,Female
James,40,Miami,Male
Aria,22,Brambleton,Female


In [24]:
user_info.describe()

Unnamed: 0,age
count,5.0
mean,27.0
std,8.485281
min,18.0
25%,22.0
50%,25.0
75%,30.0
max,40.0


In [25]:
user_info.describe(include = ['object'])

Unnamed: 0,city,sex
count,5,5
unique,5,2
top,Los Angeles,Male
freq,1,3


In [26]:
user_info.groupby('sex')['age'].count()

sex
Female    2
Male      3
Name: age, dtype: int64

In [27]:
user_info['sex'].value_counts()

sex
Male      3
Female    2
Name: count, dtype: int64

In [28]:
user_info['age'].sum()

135

## Sorting
Sorting data is a common operation in data analysis. Pandas supports two types of sorting: sorting by axis (index or columns) and sorting by actual values.

* ```sort_index()``` method is used to sort by labels (row labels or column labels).
* ```sort_values()``` method is used to sort by the values in one or more columns.

These two methods are also available for Series, and their functionality is very similar. The difference is that sort_index() operates on a Series object, while sort_index() and sort_values() can be used on DataFrame objects.

In [29]:
user_info.sort_index()

Unnamed: 0_level_0,age,city,sex
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Aria,22,Brambleton,Female
Bob,30,New York,Male
James,40,Miami,Male
Mary,25,Chicago,Female
Tom,18,Los Angeles,Male


In [30]:
user_info.sort_index(axis = 1, ascending = True)

Unnamed: 0_level_0,age,city,sex
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Tom,18,Los Angeles,Male
Bob,30,New York,Male
Mary,25,Chicago,Female
James,40,Miami,Male
Aria,22,Brambleton,Female


In [31]:
user_info.sort_values(by = 'age')

Unnamed: 0_level_0,age,city,sex
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Tom,18,Los Angeles,Male
Aria,22,Brambleton,Female
Mary,25,Chicago,Female
Bob,30,New York,Male
James,40,Miami,Male


## Conditional element-wise function

### where
```python
where(cond,other = np.nan, inplace = False,axis = None,level = None, errors = "raise", try_cast = False)
```

In [32]:
user_info['score'] = [98, 95, 87, 83, 100]
user_info

Unnamed: 0_level_0,age,city,sex,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Tom,18,Los Angeles,Male,98
Bob,30,New York,Male,95
Mary,25,Chicago,Female,87
James,40,Miami,Male,83
Aria,22,Brambleton,Female,100


By default, when using the `where()` function in pandas, if the condition is not met, the corresponding values are replaced with `NaN` (Not a Number).

In [33]:
user_info.where(user_info['age'] > 25)

Unnamed: 0_level_0,age,city,sex,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Tom,,,,
Bob,30.0,New York,Male,95.0
Mary,,,,
James,40.0,Miami,Male,83.0
Aria,,,,


In [34]:
user_info.where(user_info['age'] > 25).dropna()

Unnamed: 0_level_0,age,city,sex,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Bob,30.0,New York,Male,95.0
James,40.0,Miami,Male,83.0


In [35]:
user_info[user_info['age'] > 25]

Unnamed: 0_level_0,age,city,sex,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Bob,30,New York,Male,95
James,40,Miami,Male,83


In [36]:
user_info.where(user_info['age'] > 25, 25) # set the age less than 25 to be 25

Unnamed: 0_level_0,age,city,sex,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Tom,25,25,25,25
Bob,30,New York,Male,95
Mary,25,25,25,25
James,40,Miami,Male,83
Aria,25,25,25,25


In [37]:
user_info

Unnamed: 0_level_0,age,city,sex,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Tom,18,Los Angeles,Male,98
Bob,30,New York,Male,95
Mary,25,Chicago,Female,87
James,40,Miami,Male,83
Aria,22,Brambleton,Female,100


### query()
```python
query(self, expr, inplace = False, **kwargs)
```

In [38]:
user_info.query('(age > 20) & (sex == "Male")')

Unnamed: 0_level_0,age,city,sex,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Bob,30,New York,Male,95
James,40,Miami,Male,83


## Dealing with duplicate elements
### Duplicated method

```python
duplicated(subset, keep = 'first')
```
**subset**: The subset parameter in the context of handling duplicate elements refers to the columns that are used to identify duplicates. By default, it considers all columns in the DataFrame. For a single column, you can provide a string indicating the column name. For multiple columns, you can provide a *list* of column names.

**keep**: The keep parameter determines how duplicate rows are labeled. It affects the resulting boolean Series or DataFrame where duplicates are marked as True. The available options are:

* keep = 'first' (default): Marks the first occurrence of each duplicate row as True and the subsequent occurrences as False.
* keep = 'last': Marks the last occurrence of each duplicate row as True and the preceding occurrences as False.
* keep = False: Marks all occurrences of duplicate rows as True, considering all duplicate elements as True.


In [39]:
user_info.duplicated(subset = 'sex')

name
Tom      False
Bob       True
Mary     False
James     True
Aria      True
dtype: bool

## drop_duplicates() method

```python
drop_duplicates(subset, keep = "first", inplace = False, ignore_index = False)
```

**subset**: Optional parameter that specifies the subset of columns to consider when identifying duplicates. By default, it is set to None, which means all columns are considered.

**keep**: Specifies how to mark the duplicate rows. It can take the following values:

* first (default): Keeps the first occurrence of each duplicate row and removes the subsequent duplicates.

* last: Keeps the last occurrence of each duplicate row and removes the preceding duplicates.

* False: Removes all occurrences of duplicate rows.

**inplace**: Specifies whether to perform the operation in place or return a new DataFrame. By default, it is set to False, which returns a new DataFrame.

In [40]:
user_info.drop_duplicates(subset = ['sex'])

Unnamed: 0_level_0,age,city,sex,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Tom,18,Los Angeles,Male,98
Mary,25,Chicago,Female,87


## Customize function 

We can customize our own functions and apply them to a DataFrame or Series. The commonly used functions are `map`, `apply`, and `applymap`.

`map` is a method specific to Series. It allows us to apply a transformation to each element in the Series.

`apply` method is available for both Series and DataFrame. When applied to a Series, it operates on each value individually. When applied to a DataFrame, it operates on either all rows or all columns (controlled by the axis parameter).

`applymap` method is specifically for DataFrame. It applies a function to each element in the DataFrame. The effect of applymap on a DataFrame is similar to that of apply on a Series.


In [41]:
user_info

Unnamed: 0_level_0,age,city,sex,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Tom,18,Los Angeles,Male,98
Bob,30,New York,Male,95
Mary,25,Chicago,Female,87
James,40,Miami,Male,83
Aria,22,Brambleton,Female,100


In [42]:
# To determine whether a user age greater than 30
user_info['age'].map(lambda x: "yes" if x >= 30 else "no")

name
Tom       no
Bob      yes
Mary      no
James    yes
Aria      no
Name: age, dtype: object

In [43]:
# Get the maximum value of each column
user_info.apply(lambda x: x.max(), axis = 0)

age            40
city     New York
sex          Male
score         100
dtype: object

In [44]:
# Add 1 to the age
user_info.apply(lambda x: x.age + 1, axis = 1)

name
Tom      19
Bob      31
Mary     26
James    41
Aria     23
dtype: int64

# The operations of adding, deleting, modifying, and querying a DataFrame



## Index/Lookup

From the perspective of lookup methods, there are three types:

* Lookup by label/index/column name
* Lookup by position/index
* Lookup by boolean condition
*Note: When using multiple boolean conditions, use parentheses with "&" (and), "|" (or), and "~" (not).

Syntax-wise, there are five types:

* `df.col`: Select column by column name (not recommended). It will raise an error if the column does not exist.

* `df.get(col, default)`: Retrieve specified column data by column name. If it does not exist, it does not raise an error and can be given a default value.

* `df[]`:

    - [col]: Retrieve a single column by column name.
    - [[col1, col2, ...]]: Retrieve multiple columns by column names.
    - [start: end]: Retrieve rows by position using slicing.
    - [boolean expression]: Retrieve rows based on boolean conditions.

* `df.loc[row label, column label]` (recommended):

    - Rows and columns can be row index names or column names, representing a single row or column.
    - Rows and columns can be lists of row index names or column names, representing multiple rows or columns.
    - Rows and columns can be boolean expressions, performing lookup based on the boolean conditions.

* `df.iloc[row index, column index]`:

    - Rows and columns can be row indices or column indices, representing a single row or column.
    - Rows and columns can be lists of row indices or column indices, representing multiple rows or columns.
    - Rows and columns can be row index slices or column index slices, representing a range of consecutive rows or columns.

* `df.select_dtypes(include=None, exclude=None)`: Select specified columns by data types.

These methods provide various ways to perform index-based and label-based lookup in a DataFrame, enabling you to retrieve single or multiple rows/columns based on labels, positions, or boolean conditions, as well as filtering columns based on data types.

In [45]:
data  = {
    "age" : [18, 30, 25, 40, 22],
    "city" : ["Los Angeles", "New York", "Chicago", "Miami", "Brambleton"],
    'sex': ['Male', 'Male', 'Female', 'Male', 'Female']
}
index = pd.Index( ["Tom", "Bob", "Mary", "James", "Aria"], name = 'name')
user_info = pd.DataFrame(data = data, index = index)
user_info

Unnamed: 0_level_0,age,city,sex
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Tom,18,Los Angeles,Male
Bob,30,New York,Male
Mary,25,Chicago,Female
James,40,Miami,Male
Aria,22,Brambleton,Female


######  Retrieve a single column by column name

In [46]:
user_info['age'] 

name
Tom      18
Bob      30
Mary     25
James    40
Aria     22
Name: age, dtype: int64

In [47]:
user_info.age

name
Tom      18
Bob      30
Mary     25
James    40
Aria     22
Name: age, dtype: int64

######  the get() method is used to retrieve the column named 'age' from the DataFrame df. 

If the column exists, it will return the column data. 

If the column does not exist, it will return the specified default value 'N/A


In [48]:
user_info.get('age') 

name
Tom      18
Bob      30
Mary     25
James    40
Aria     22
Name: age, dtype: int64

In [49]:
user_info.get('age2',22) 

22

In [50]:
user_info.loc[:, 'age']

name
Tom      18
Bob      30
Mary     25
James    40
Aria     22
Name: age, dtype: int64

In [51]:
user_info[['age', 'city']]

Unnamed: 0_level_0,age,city
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Tom,18,Los Angeles
Bob,30,New York
Mary,25,Chicago
James,40,Miami
Aria,22,Brambleton


###### To retrieve a single column by column index 

In [52]:
user_info.iloc[:, 1]

name
Tom      Los Angeles
Bob         New York
Mary         Chicago
James          Miami
Aria      Brambleton
Name: city, dtype: object

In [53]:
user_info.iloc[:, [0, 1]]

Unnamed: 0_level_0,age,city
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Tom,18,Los Angeles
Bob,30,New York
Mary,25,Chicago
James,40,Miami
Aria,22,Brambleton


###### To retrieve a single column by data type

In [54]:
user_info.select_dtypes(include = 'int64')

Unnamed: 0_level_0,age
name,Unnamed: 1_level_1
Tom,18
Bob,30
Mary,25
James,40
Aria,22


###### To retrieve specific rows of data from a DataFrame

In [55]:
user_info.loc['Tom']  # single row

age              18
city    Los Angeles
sex            Male
Name: Tom, dtype: object

In [56]:
user_info.loc[['Tom', 'Bob']]  # Multiple rows

Unnamed: 0_level_0,age,city,sex
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Tom,18,Los Angeles,Male
Bob,30,New York,Male


###### To retrieve rows of data using index 

In [57]:
user_info.iloc[0]  # The first row

age              18
city    Los Angeles
sex            Male
Name: Tom, dtype: object

In [58]:
user_info.iloc[[0,1]]  # First two rows

Unnamed: 0_level_0,age,city,sex
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Tom,18,Los Angeles,Male
Bob,30,New York,Male


In [59]:
user_info.iloc[0:2]  #  Slicing first two rows

Unnamed: 0_level_0,age,city,sex
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Tom,18,Los Angeles,Male
Bob,30,New York,Male


###### To retrieve the data by boolean index

In [60]:
user_info[user_info.age >= 30]

Unnamed: 0_level_0,age,city,sex
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Bob,30,New York,Male
James,40,Miami,Male


In [61]:
user_info[(user_info.age >= 30) & (user_info.city == 'New York')]

Unnamed: 0_level_0,age,city,sex
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Bob,30,New York,Male


###### isin() method 

In [62]:
user_info[user_info.city.isin(['New York','Seatle'])]

Unnamed: 0_level_0,age,city,sex
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Bob,30,New York,Male


###### To retrieve data based on labels

In [63]:
user_info.loc['Tom', 'age']

18

In [64]:
user_info.loc[['Tom', 'James'], ['age', 'city']]

Unnamed: 0_level_0,age,city
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Tom,18,Los Angeles
James,40,Miami


###### To retrieve data based on index

In [65]:
user_info.iloc[0, 0]  # first row first column element

18

In [66]:
user_info.iloc[[0, 3], [0, 1]] # first row, fourth row, first column and second column 

Unnamed: 0_level_0,age,city
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Tom,18,Los Angeles
James,40,Miami


## Adding elements
`df[new_col] = data`: Adds a new column new_col to the DataFrame df with values from data, which can be a list-like object.

`df.append(other)`: Appends rows from other to the end of the DataFrame df. The other can be a DataFrame, Series, dictionary-like object, or a list of these types. If other is a Series or dictionary-like object, it will be treated as a single row. If other is a DataFrame or a list of these types, it will be treated as multiple rows.

`df.insert(loc, column, value)`: Inserts a new column column at the specified loc (position) in the DataFrame df, with values from value.


`df.loc[index, col] = data` (recommended): Adds a new row or column to the DataFrame df using the loc accessor. This is the recommended method for adding rows or new 

In [67]:
data = {
    "age" : [18, 30, 25, 40, 22],
    "city" : ["Los Angeles", "New York", "Chicago", "Miami", "Brambleton"],
}
index = pd.Index( ["Tom", "Bob", "Mary", "James", "Aria"], name = 'name')
user_info = pd.DataFrame(data, index = index)
user_info

Unnamed: 0_level_0,age,city
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Tom,18,Los Angeles
Bob,30,New York
Mary,25,Chicago
James,40,Miami
Aria,22,Brambleton


###### Adding new column

In [68]:
user_info['sex'] =  ['Male', 'Male', 'Female', 'Male', 'Female' ]
user_info

Unnamed: 0_level_0,age,city,sex
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Tom,18,Los Angeles,Male
Bob,30,New York,Male
Mary,25,Chicago,Female
James,40,Miami,Male
Aria,22,Brambleton,Female


##### To insert a new column at a specified position in a DataFrame

In [69]:
user_info.insert(1, "hobby", ["Soccer", "Baseball", "Dance", "Reading", "Smile"])

In [70]:
user_info

Unnamed: 0_level_0,age,hobby,city,sex
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Tom,18,Soccer,Los Angeles,Male
Bob,30,Baseball,New York,Male
Mary,25,Dance,Chicago,Female
James,40,Reading,Miami,Male
Aria,22,Smile,Brambleton,Female


The `assign()` method in DataFrame allows you to create new columns based on existing columns. It returns a new DataFrame with the additional columns included.

In [71]:
user_info.assign(description = user_info['city'] + '- '+ user_info['sex'] + '-' + user_info['hobby'])

Unnamed: 0_level_0,age,hobby,city,sex,description
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Tom,18,Soccer,Los Angeles,Male,Los Angeles- Male-Soccer
Bob,30,Baseball,New York,Male,New York- Male-Baseball
Mary,25,Dance,Chicago,Female,Chicago- Female-Dance
James,40,Reading,Miami,Male,Miami- Male-Reading
Aria,22,Smile,Brambleton,Female,Brambleton- Female-Smile


###### Adding new row

In [72]:
user_info.loc['Linda'] = [28, 'Singing', 'DC', 'Female']
user_info

Unnamed: 0_level_0,age,hobby,city,sex
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Tom,18,Soccer,Los Angeles,Male
Bob,30,Baseball,New York,Male
Mary,25,Dance,Chicago,Female
James,40,Reading,Miami,Male
Aria,22,Smile,Brambleton,Female
Linda,28,Singing,DC,Female


In [73]:
# data = {
#     "age" : [18, 30, 25, 40, 22],
#     "city" : ["Los Angeles", "New York", "Chicago", "Miami", "Seatle"],
# }
# index = pd.Index( ["Tom", "Bob", "Mary", "James", "Sun"], name = 'name')
# user_info = pd.DataFrame(data, index = index)

# user1 = pd.Series(data=[23, 'Huston', 'Hiking', 'Female'], 
#                    index = ['age', 'city', 'hobby', 'sex'],name = 'Lisa')
# concatenated_results = pd.concat([user_info, pd.DataFrame(user1).T])

## Modifying a DataFrame

The relevant methods for modifying a DataFrame are as follows:

* `df.columns = columns`: Set new column names using a list.
* `df.index = index`: Set new index values using a list.
* `df.rename(index=None, columns=None, inplace=False)`: Modify the index using index parameter and modify the column names using columns parameter. The inplace parameter determines whether to modify the original DataFrame.
* `df.loc[index, col]` = value: Modify specific data based on row or column labels.
* `df.iloc[iindex, icol]` = value: Modify specific data based on row and column positions.


## Deletion methods in DataFrame:

* `df.drop(index=None, columns=None, inplace=False)`: Delete rows based on index using the index parameter, or delete columns based on column names using the columns parameter. The inplace parameter determines whether to modify the original DataFrame. This method returns the deleted rows or columns.

* `df.pop(col)`: Delete the specified column and return the deleted column.

* `del df[col]`: Delete the specified column without returning any value.

# Iterating over a DataFrame in pandas

In [74]:
for col in user_info:
    print(col)

age
hobby
city
sex


In [75]:
for col in user_info:
    print(col)
    print(user_info[col])

age
name
Tom      18
Bob      30
Mary     25
James    40
Aria     22
Linda    28
Name: age, dtype: int64
hobby
name
Tom        Soccer
Bob      Baseball
Mary        Dance
James     Reading
Aria        Smile
Linda     Singing
Name: hobby, dtype: object
city
name
Tom      Los Angeles
Bob         New York
Mary         Chicago
James          Miami
Aria      Brambleton
Linda             DC
Name: city, dtype: object
sex
name
Tom        Male
Bob        Male
Mary     Female
James      Male
Aria     Female
Linda    Female
Name: sex, dtype: object


In [76]:
# Not finished