In [1]:
import numpy as np
import pandas as pd

---

### Basic data structures in pandas

1. `Series`: __a one-dimensional labeled array__ holding data of any type
    <br>such as integers, strings, Python objects etc.

2. `DataFrame`: __a two-dimensional data structure__ that holds data like a two-dimension array or __a table with rows and columns__.

---

### Object creation

In [2]:
# letting pandas create a default RangeIndex
s = pd.Series([1, 2, 3, np.nan, 5, 6])

`numpy.nan` is a special value from the `NumPy library` that represents the `null` value (a `missing value` or `undefined numerical result`)

`numpy.nan` returns a value of type `float`.

__Pandas__ uses `np.nan` to represent __missing values in numeric data types__.


---

`np.nan` is __not equal to anything__, including itself:


This property makes it suitable for representing missing values, as it doesn't interfere with normal equality checks.

In [3]:
print(np.nan == np.nan)  # Output: False

False


---

`Creating a DataFrame` by passing a NumPy array with a datetime index using `pd.date_range()` and labeled columns:

In [4]:
# default frequency - days  

pd.date_range('2013-01-01', periods=5)

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05'],
              dtype='datetime64[ns]', freq='D')

In [5]:
# If you want the first day of each month, use the frequency 'MS' (Month Start):

pd.date_range('2013-01-01', periods=5, freq='MS')

DatetimeIndex(['2013-01-01', '2013-02-01', '2013-03-01', '2013-04-01',
               '2013-05-01'],
              dtype='datetime64[ns]', freq='MS')

In [6]:
# 'ME' stands for Month End frequency.

pd.date_range('2013-01-01', periods=5, freq='ME')

DatetimeIndex(['2013-01-31', '2013-02-28', '2013-03-31', '2013-04-30',
               '2013-05-31'],
              dtype='datetime64[ns]', freq='ME')

In [7]:
[i.day for i in pd.date_range(start='2025-01-01', periods=12, freq='ME')]

[31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]

---

`np.random.randn()`
- Return a sample (or samples) from the `"standard normal" distribution`.
- random floats sampled from `a univariate "normal" (Gaussian) 
distribution of mean 0 and variance 1 `.

In [3]:
list('ABCD')

['A', 'B', 'C', 'D']

In [4]:
df = pd.DataFrame(np.random.randn(6, 4), 
                  index=pd.date_range('20130101', periods=6),
                  columns=list("ABCD"))

In [5]:
df

Unnamed: 0,A,B,C,D
2013-01-01,-1.474355,0.531672,0.025491,-0.302742
2013-01-02,-0.726984,-0.746165,0.702313,0.803666
2013-01-03,-0.454112,-0.225644,-0.211412,-0.265718
2013-01-04,-0.136119,-0.401121,-2.291708,-1.362184
2013-01-05,-1.690095,1.01561,0.173992,0.105964
2013-01-06,0.768235,-0.3077,-2.00575,-0.492335


In [11]:
df.loc['2013-01-01']

A    0.709946
B    1.478118
C    0.643158
D   -1.004364
Name: 2013-01-01 00:00:00, dtype: float64

In [12]:
type(df.loc['2013-01-01'])

pandas.core.series.Series

In [13]:
df.iloc[0]

A    0.709946
B    1.478118
C    0.643158
D   -1.004364
Name: 2013-01-01 00:00:00, dtype: float64

---

`Creating a DataFrame` by passing `a dictionary` of objects where the keys are the column labels and the values are the column values.



In [8]:
df2 = pd.DataFrame(
    {
        'A': 1.0,
        'B': pd.Timestamp('2013-01-01'),
        'C': pd.Series(1, index=list(range(4)), dtype='float32'),
        'D': np.array([3] * 4, dtype='int32'),
        'E': pd.Categorical(['test', 'train', 'test', 'train']),
        'F': 'foo'
    }
)

In [9]:
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-01,1.0,3,test,foo
1,1.0,2013-01-01,1.0,3,train,foo
2,1.0,2013-01-01,1.0,3,test,foo
3,1.0,2013-01-01,1.0,3,train,foo


In [10]:
np.array([3] * 4)

array([3, 3, 3, 3])

In [11]:
[3] * 4 # list and not array is multiplied

[3, 3, 3, 3]

In [12]:
np.array([3]) * 4

array([12])

---

In [13]:
d = pd.Timestamp('2025-01-01')

In [14]:
type(d)

pandas._libs.tslibs.timestamps.Timestamp

In [15]:
d_series = pd.Series([pd.Timestamp('2025-07-07')] * 4)

In [16]:
d_series

0   2025-07-07
1   2025-07-07
2   2025-07-07
3   2025-07-07
dtype: datetime64[ns]

In [17]:
d_series.dtype

dtype('<M8[ns]')

---

`<M8[ns]` is the NumPy internal representation of `datetime64[ns]`.
<br> It indicates datetime data stored with `nanosecond precision`.

### Key Points

There is `NO` specific `dtype` called `timestamp` in pandas or NumPy. However:

- In pandas, date-time-related data is often represented as `datetime64` types under the hood, even though individual values may appear as `pandas.Timestamp` objects.
- In NumPy, date-time data is stored using `datetime64` dtypes (e.g., datetime64[s], datetime64[ms], etc.).


---

In [18]:
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-01,1.0,3,test,foo
1,1.0,2013-01-01,1.0,3,train,foo
2,1.0,2013-01-01,1.0,3,test,foo
3,1.0,2013-01-01,1.0,3,train,foo


In [19]:
df2.dtypes

A          float64
B    datetime64[s]
C          float32
D            int32
E         category
F           object
dtype: object

---

In [20]:
pd.Series([1, 'a', 4])

0    1
1    a
2    4
dtype: object

---

## Viewing data

In [24]:
# the top and bottom rows

In [25]:
df.head()

Unnamed: 0,A,B,C,D
2013-01-01,-1.474355,0.531672,0.025491,-0.302742
2013-01-02,-0.726984,-0.746165,0.702313,0.803666
2013-01-03,-0.454112,-0.225644,-0.211412,-0.265718
2013-01-04,-0.136119,-0.401121,-2.291708,-1.362184
2013-01-05,-1.690095,1.01561,0.173992,0.105964


In [26]:
df.tail(2)

Unnamed: 0,A,B,C,D
2013-01-05,-1.690095,1.01561,0.173992,0.105964
2013-01-06,0.768235,-0.3077,-2.00575,-0.492335


---

In [27]:
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [28]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

---

In [29]:
# Return a NumPy representation of the underlying data:

df.to_numpy()

array([[-1.47435522,  0.53167239,  0.02549119, -0.30274197],
       [-0.72698401, -0.74616467,  0.70231261,  0.80366648],
       [-0.45411242, -0.22564356, -0.21141234, -0.26571822],
       [-0.13611902, -0.40112063, -2.29170839, -1.36218393],
       [-1.69009469,  1.01560992,  0.17399187,  0.10596362],
       [ 0.76823538, -0.30770033, -2.00574956, -0.49233502]])

#### *Note
__NumPy arrays have one dtype for the entire array while pandas DataFrames have one dtype per column__. When you call `DataFrame.to_numpy()`, pandas will find the NumPy dtype that can hold _all_ of the dtypes in the DataFrame. If the common data type is `object`, DataFrame.to_numpy() will require copying data.



In [30]:
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-01,1.0,3,test,foo
1,1.0,2013-01-01,1.0,3,train,foo
2,1.0,2013-01-01,1.0,3,test,foo
3,1.0,2013-01-01,1.0,3,train,foo


In [30]:
df2.to_numpy()

array([[1.0, Timestamp('2013-01-01 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-01 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-01 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-01 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

---

---

#### In pandas
<u>`object` __dtype__ </u>

 - The `object` __dtype__ terminology `is specific to pandas` and is not a general Python concept. It is a wrapper around Python objects to allow pandas to work with mixed data types in a column or Series.
 <br>The concept of a "dtype" like object does not exist in core Python.

- In native Python, data types are handled directly as Python classes (e.g., int, float, str, etc.), without an abstraction layer like pandas' dtypes.

---

#### In Core Python:
- `Strings` in Python are instances of the `str` class, and their type is `str`.
- Example:

In [54]:
s = "Hello, World!"
print(type(s))  

<class 'str'>


In Python's pandas library, the `object` data type is a flexible data type used primarily for storing __text (strings) or mixed types of data__ in a `DataFrame` or `Series`. It is the most general data type in pandas and can store any Python object, including:

- Strings (the most common use case)
- Mixed types (e.g., numbers, strings, and other objects in the same column)
- Custom Python objects


#### Characteristics of `object` dtype:
<br>


- __Flexibility__: It can hold any type of Python object, making it highly versatile.
- __Performance__: Operations on `object` dtype are generally __slower__ compared to specialized data types like `int`, `float`, or `category` because pandas does not perform type-specific optimizations.
- __Memory Usage__: `object` dtype columns use more memory because they store pointers to objects in memory, rather than directly storing the data.

In [62]:
a = pd.Series(['apple', 'banana', 'orange']) 

In [63]:
a.dtype

dtype('O')

In [64]:
print(a.dtype)

object


In [50]:
# with mixed types

b = pd.Series(['a', 1, 5.0, True])

In [51]:
print(b.dtype)

object


---

#### When to Avoid `object` dtype:


<br>

- If you're dealing with text data and there are only a few unique values, consider converting the column to the `category` dtype for better performance and memory efficiency.
- For numeric data, use `int`or `float` types for faster computations.


In [34]:
# # Convert text data to category

data = pd.Series(['apple', 'banana', 'orange'], dtype='category')

In [38]:
data

0     apple
1    banana
2    orange
dtype: category
Categories (3, object): ['apple', 'banana', 'orange']

In [35]:
print(data.dtype)

category


In [37]:
print(data.cat.categories)

Index(['apple', 'banana', 'orange'], dtype='object')


In [39]:
len(data)

3

In [40]:
# let's add a new value that isn't in the existing categories 
data[3] = 'pear'

In [41]:
data

0     apple
1    banana
2    orange
3      pear
dtype: object

### !!!


1. Without adding a category first, Pandas silently __converts__ the `Series to object` when assigning a new value.
2. To preserve categorical dtype, you must explicitly add the new category using `.cat.add_categories(new_categories)`.


In [42]:
data = pd.Series(['apple', 'banana', 'apple'], dtype='category')

In [43]:
data

0     apple
1    banana
2     apple
dtype: category
Categories (2, object): ['apple', 'banana']

In [46]:
data = data.cat.add_categories('pear')

In [47]:
data

0     apple
1    banana
2     apple
dtype: category
Categories (3, object): ['apple', 'banana', 'pear']

In [48]:
data[3] = 'pear'

In [49]:
data

0     apple
1    banana
2     apple
3      pear
dtype: category
Categories (3, object): ['apple', 'banana', 'pear']

---

### In pandas:
- __Strings__ in a pandas `Series` or `DataFrame` can be stored as:

    1. `object` dtype: When pandas does not optimize specifically for strings, it stores them as generic Python objects (of type `str`).
    2. `string` dtype: A specialized dtype introduced in recent versions of pandas for string storage and operations, offering better consistency and functionality.

In [55]:
# Example 
a = pd.Series(['apple', 'banana', 'orange'])

In [57]:
a.dtype

dtype('O')

In [44]:
df2['F']

0    foo
1    foo
2    foo
3    foo
Name: F, dtype: object

In [45]:
df2['F'] = df2['F'].astype('string')

In [46]:
df2['F']

0    foo
1    foo
2    foo
3    foo
Name: F, dtype: string

---

- Use `.dtype` for a single column or Series.
- Use `.dtypes` for an overview of all column data types in a DataFrame.







In [58]:
data = {'A': [1, 2, 3], 'B': ['x', 'y', 'z']}
print(id(data))
data = pd.DataFrame(data)
print(id(data))


2582709371136
2582674572976


In [59]:
data['A'].dtype

dtype('int64')

In [60]:
data['B'].dtype

dtype('O')

In [61]:
data.dtypes

A     int64
B    object
dtype: object

---

`.describe()`

Generate _descriptive statistics_ (excluding NaN values).



In [66]:
df

Unnamed: 0,A,B,C,D
2013-01-01,0.604343,-1.458726,-0.026405,-0.898501
2013-01-02,-0.833029,1.575948,0.593999,0.074502
2013-01-03,0.521692,-0.17657,-0.06933,0.02197
2013-01-04,0.190665,-0.81656,0.104098,1.144434
2013-01-05,1.174129,-0.637501,-1.734045,0.663057
2013-01-06,0.003405,-1.347315,-0.143134,-0.191517


In [67]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.276868,-0.476787,-0.212469,0.135657
std,0.676216,1.110599,0.79066,0.705545
min,-0.833029,-1.458726,-1.734045,-0.898501
25%,0.05022,-1.214626,-0.124683,-0.138145
50%,0.356178,-0.727031,-0.047868,0.048236
75%,0.583681,-0.291803,0.071473,0.515918
max,1.174129,1.575948,0.593999,1.144434


---

`Transposing your data`

In [69]:
df

Unnamed: 0,A,B,C,D
2013-01-01,0.604343,-1.458726,-0.026405,-0.898501
2013-01-02,-0.833029,1.575948,0.593999,0.074502
2013-01-03,0.521692,-0.17657,-0.06933,0.02197
2013-01-04,0.190665,-0.81656,0.104098,1.144434
2013-01-05,1.174129,-0.637501,-1.734045,0.663057
2013-01-06,0.003405,-1.347315,-0.143134,-0.191517


In [70]:
df.T

Unnamed: 0,2013-01-01,2013-01-02,2013-01-03,2013-01-04,2013-01-05,2013-01-06
A,0.604343,-0.833029,0.521692,0.190665,1.174129,0.003405
B,-1.458726,1.575948,-0.17657,-0.81656,-0.637501,-1.347315
C,-0.026405,0.593999,-0.06933,0.104098,-1.734045,-0.143134
D,-0.898501,0.074502,0.02197,1.144434,0.663057,-0.191517


---

`.sort_index()`

DataFrame.sort_index() sorts by an axis:




In [74]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2013-01-01,-0.898501,-0.026405,-1.458726,0.604343
2013-01-02,0.074502,0.593999,1.575948,-0.833029
2013-01-03,0.02197,-0.06933,-0.17657,0.521692
2013-01-04,1.144434,0.104098,-0.81656,0.190665
2013-01-05,0.663057,-1.734045,-0.637501,1.174129
2013-01-06,-0.191517,-0.143134,-1.347315,0.003405


In [76]:
df.sort_index(axis=0, ascending=False)

Unnamed: 0,A,B,C,D
2013-01-06,0.003405,-1.347315,-0.143134,-0.191517
2013-01-05,1.174129,-0.637501,-1.734045,0.663057
2013-01-04,0.190665,-0.81656,0.104098,1.144434
2013-01-03,0.521692,-0.17657,-0.06933,0.02197
2013-01-02,-0.833029,1.575948,0.593999,0.074502
2013-01-01,0.604343,-1.458726,-0.026405,-0.898501


---

`.sort_values()`

DataFrame.sort_values() sorts by values:



In [77]:
df.sort_values(by="B")

Unnamed: 0,A,B,C,D
2013-01-01,0.604343,-1.458726,-0.026405,-0.898501
2013-01-06,0.003405,-1.347315,-0.143134,-0.191517
2013-01-04,0.190665,-0.81656,0.104098,1.144434
2013-01-05,1.174129,-0.637501,-1.734045,0.663057
2013-01-03,0.521692,-0.17657,-0.06933,0.02197
2013-01-02,-0.833029,1.575948,0.593999,0.074502


---

## SELECTION

pandas data access methods: 
- DataFrame.at()
- DataFrame.iat() 
- DataFrame.loc()
- DataFrame.iloc()

## Getitem ([])

For a `DataFrame`, passing __a single label__ selects a columns and yields a `Series` equivalent to df.A:

In [79]:
df['A']

2013-01-01    0.604343
2013-01-02   -0.833029
2013-01-03    0.521692
2013-01-04    0.190665
2013-01-05    1.174129
2013-01-06    0.003405
Freq: D, Name: A, dtype: float64

---

For a `DataFrame`, passing a slice `:` selects matching `rows`:



In [80]:
df[0:3]

Unnamed: 0,A,B,C,D
2013-01-01,0.604343,-1.458726,-0.026405,-0.898501
2013-01-02,-0.833029,1.575948,0.593999,0.074502
2013-01-03,0.521692,-0.17657,-0.06933,0.02197


The __indices__ are `DatetimeIndex`.

In [86]:
df['20130102':'20130104'] # including both ends

Unnamed: 0,A,B,C,D
2013-01-02,-0.833029,1.575948,0.593999,0.074502
2013-01-03,0.521692,-0.17657,-0.06933,0.02197
2013-01-04,0.190665,-0.81656,0.104098,1.144434


Slicing `df[0:3]` selects the first 3 rows `by position (0-based indexing)`, not by the labels.

The slicing `df[0:3]` is functionally equivalent to `df.iloc[0:3]`. Both select rows by position (using integer-based indexing) in a 0-based manner.

