# Series
Start your Notebook and import the required libraries:

In [3]:
import numpy as np
import pandas as pd

`Series` is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create `Series` is to call:
```python
s = pd.Series(data, index=index)
```
`data` can be many different things:
* Python dict
* an ndarray
* a scalar value

The passed index is a list of axis labels.

## From ndarray
If data is an ndarray, an index must be the same length as the data. If no index is passed, one will be created having values `[0, ..., len(data) - 1]`.

In [4]:
# Here, we specify the index 
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [5]:
s

a   -0.636039
b    1.983223
c    2.394180
d   -1.567471
e   -2.087615
dtype: float64

In [6]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [7]:
# Here, we let Pandas create a default index
pd.Series(np.random.randn(5))

0    1.549223
1   -1.204702
2   -2.112614
3    1.292282
4   -1.469574
dtype: float64

Pandas supports non-unique index values. If an operation that does not support duplicate index values is attempted, an exception will be raised at that time.

## From dict
Series can be created from dicts:

In [8]:
d = {'b': 1, 'a': 0, 'c': 2}
pd.Series(d)

b    1
a    0
c    2
dtype: int64

When data is a dict, and an index is not passed, the Series index will be ordered by the dict’s insertion order. There is no sorting if you have Python version >= 3.6 and Pandas version >= 0.23.

## From scalar value
If data is a scalar value, an index must be provided. The value will be repeated to match the length of the index.

In [9]:
pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])

a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

## Series is `ndarray`-like
Series acts very similarly to a ndarray from NumPy and is a valid argument to most NumPy functions. Operations such as slicing will also slice the index.

In [10]:
s[0]

-0.6360389901678979

In [11]:
s[:3]

a   -0.636039
b    1.983223
c    2.394180
dtype: float64

In [11]:
s[s > s.median()]

a    1.054735
e    1.870592
dtype: float64

In [12]:
s[[4, 3, 1]]

e   -2.087615
d   -1.567471
b    1.983223
dtype: float64

In [12]:
np.exp(s)

a    2.871215
b    0.285402
c    1.231258
d    0.326614
e    6.492135
dtype: float64

We will address indexing and slicing in the following tutorials.

Each serries has a `dtype`.

In [13]:
s.dtype

dtype('float64')

While `Series` is `ndarray`-like, if you need an actual `ndarray`, then use `Series.to_numpy()`

In [14]:
s.to_numpy()

array([ 1.05473515, -1.25385523,  0.2080363 , -1.11897653,  1.8705915 ])

## Series is `dict`-like
A `Series` is like a fixed-size `dict` in which you can get and set values by an index label.

In [13]:
s['a']

-0.6360389901678979

In [14]:
s['e'] = 12

In [15]:
s

a    -0.636039
b     1.983223
c     2.394180
d    -1.567471
e    12.000000
dtype: float64

In [16]:
'e' in s

True

In [17]:
'f' in s

False

If a label is not contained, an exception is raised:

In [18]:
s['f']

KeyError: 'f'

## Vectorized operations
When working with raw NumPy arrays, looping through value-by-value is usually not necessary. The same is true when working with Series in Pandas. Series can also be passed into most NumPy methods expecting an ndarray.

In [19]:
s + s

a    -1.272078
b     3.966446
c     4.788361
d    -3.134941
e    24.000000
dtype: float64

In [20]:
s * 2

a    -1.272078
b     3.966446
c     4.788361
d    -3.134941
e    24.000000
dtype: float64

In [21]:
np.exp(s)

a         0.529385
b         7.266123
c        10.959211
d         0.208572
e    162754.791419
dtype: float64

A key difference between Series and ndarray is that operations between Series automatically align data based on the label. Thus, you can write computations without considering whether the Series involved have the same labels.

In [24]:
s1 = s[1:]

In [25]:
s2 = s[:-1]

In [26]:
s1 + s2

a         NaN
b   -2.507710
c    0.416073
d   -2.237953
e         NaN
dtype: float64

The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found in one Series or the other, the result will be marked as missing NaN.

## Name attribute
Series can also have a name attribute.

In [22]:
s = pd.Series(np.random.randn(5), name='something')

In [23]:
s

0    0.749557
1   -1.159844
2   -0.929154
3    0.738164
4   -0.000674
Name: something, dtype: float64

In [29]:
s.name

'something'

# DataFrame
`DataFrame` is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. 
It is generally the most commonly used Pandas object. Like Series, DataFrame accepts many different kinds of input:
* dict of 1D ndarrays, lists, dicts, or Series
* 2D NumPy ndarray
* Series
* DataFrame

## From dict of Series or dicts
The resulting index will be the union of the indexes of the various Series.

In [31]:
d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
         'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])} 

In [32]:
df = pd.DataFrame(d)

In [33]:
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [34]:
pd.DataFrame(d, index=['d', 'b', 'a'])

Unnamed: 0,one,two
d,,4.0
b,2.0,2.0
a,1.0,1.0


In [35]:
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])

Unnamed: 0,two,three
d,4.0,
b,2.0,
a,1.0,


When data is a dict, and a columns is not passed, the DataFrame columns will be ordered by the dict’s insertion order. There is no sorting if you have Python version >= 3.6 and Pandas version >= 0.23.

    If you pass an index and/or columns, you are guaranteeing the index and/or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index and/or columns will discard all data not matching to the passed index and/or columns. See the last example above with an empty column labeled three.

## From dict of Series or dicts
The ndarrays must all be of the same length. If an index is passed, it must obviously also be of the same length as the arrays. If no index is passed, the result will be range(n), where n is the array length.

In [36]:
d = {'one': [1., 2., 3., 4.],
         'two': [4., 3., 2., 1.]}
pd.DataFrame(d)

Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0


In [37]:
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

Unnamed: 0,one,two
a,1.0,4.0
b,2.0,3.0
c,3.0,2.0
d,4.0,1.0


## From a Series
The result will be a DataFrame with the same index as the input Series, and with one column whose name is the original name of the Series (only if no other column name provided).

In [38]:
pd.DataFrame(pd.Series(np.random.randn(5), name='something'))

Unnamed: 0,something
0,0.208107
1,0.592532
2,-0.72983
3,-0.905216
4,1.646468


## Column selection, addition, deletion
You can treat a DataFrame semantically like a dict of like-indexed Series objects. Getting, setting, and deleting columns works with the same syntax as the analogous dict operations:

In [39]:
df['one']

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

In [40]:
df['three'] = df['one'] * df['two']

In [41]:
df['flag'] = df['one'] > 2
df

Unnamed: 0,one,two,three,flag
a,1.0,1.0,1.0,False
b,2.0,2.0,4.0,False
c,3.0,3.0,9.0,True
d,,4.0,,False


Columns can be deleted like with a dict:

In [42]:
del df['two']
df

Unnamed: 0,one,three,flag
a,1.0,1.0,False
b,2.0,4.0,False
c,3.0,9.0,True
d,,,False


When inserting a scalar value, it will naturally be propagated to fill the column:

In [43]:
df['foo'] = 'bar'
df

Unnamed: 0,one,three,flag,foo
a,1.0,1.0,False,bar
b,2.0,4.0,False,bar
c,3.0,9.0,True,bar
d,,,False,bar


When inserting a Series that does not have the same index as the DataFrame, it will be conformed to the DataFrame’s index.

In [44]:
df['one_trunc'] = df['one'][:2]

In [45]:
df

Unnamed: 0,one,three,flag,foo,one_trunc
a,1.0,1.0,False,bar,1.0
b,2.0,4.0,False,bar,2.0
c,3.0,9.0,True,bar,
d,,,False,bar,


Operations with scalars are just as you would expect:

In [47]:
df = pd.DataFrame(np.random.randn(8, 3), columns=list('ABC'))

In [48]:
df * 5 + 2

Unnamed: 0,A,B,C
0,0.701175,2.782983,-0.160568
1,4.19994,0.456765,0.506576
2,7.727698,-1.066128,-4.762556
3,-2.374685,4.653812,-1.89183
4,5.278341,3.484398,0.288415
5,2.236978,-2.684841,6.370091
6,-5.700644,7.234736,-6.332195
7,6.443639,-4.253212,4.840517


In [49]:
1 / df

Unnamed: 0,A,B,C
0,-3.849632,6.385838,-2.314206
1,2.272789,-3.239947,-3.348012
2,0.872951,-1.630721,-0.739365
3,-1.142939,1.884082,-1.284743
4,1.525161,3.36837,-2.921269
5,21.098979,-1.067272,1.144141
6,-0.649296,0.955158,-0.600082
7,1.125204,-0.799589,1.760243


In [50]:
df ** 4

Unnamed: 0,A,B,C
0,0.004553,0.000601,0.034865
1,0.037477,0.009075,0.007959
2,1.722029,0.14141,3.34629
3,0.586013,0.07936,0.367059
4,0.184815,0.007768,0.013731
5,5e-06,0.770725,0.583555
6,5.626369,1.201432,7.711833
7,0.623843,2.446429,0.104162


Boolean operators are vectorized as well:

In [51]:
df1 = pd.DataFrame({'a': [1, 0, 1], 'b': [0, 1, 1]}, dtype=bool)
df2 = pd.DataFrame({'a': [0, 1, 1], 'b': [1, 1, 0]}, dtype=bool)

In [52]:
df1 & df2

Unnamed: 0,a,b
0,False,False
1,False,True
2,True,False


In [53]:
df1 | df2

Unnamed: 0,a,b
0,True,True
1,True,True
2,True,True


In [54]:
df1 ^ df2

Unnamed: 0,a,b
0,True,True
1,True,False
2,False,True


In [55]:
-df1

Unnamed: 0,a,b
0,False,True
1,True,False
2,False,False


# `dtypes`
We can continue in the notebook from previous tutorial. If you decide to create a new one don't forget to import the packages.

For the most part, Pandas uses NumPy arrays and dtypes for Series or individual columns of a DataFrame. NumPy provides support for float, int, bool, timedelta64[ns] and datetime64[ns].

However, NumPy doesn't allow non-numeric data types, therefore, Pandas has to extend NumPy's type system in a few places. The following table lists most of Pandas extension types (the most common ones):

In [62]:
dft = pd.DataFrame({'Kind of Data': ['Categorical', 'nullable integer', 'Strings', 'Boolean (with NA)', 'any'],
                        'Data Type': ['CategoricalDtype', 'Int64Dtype', 'StringDtype', 'BooleanDtype', 'object dtype'],
                        'String Aliases': ['\'category\'', ['\'Int8\'', '\'UInt8\'', '\'Int16\'', '\'UInt16...\''], '\'string\'', ['\'boolean\'', '\'bool\''],'\'object\'']
                   })

dft

Unnamed: 0,Kind of Data,Data Type,String Aliases
0,Categorical,CategoricalDtype,'category'
1,nullable integer,Int64Dtype,"['Int8', 'UInt8', 'Int16', 'UInt16...']"
2,Strings,StringDtype,'string'
3,Boolean (with NA),BooleanDtype,"['boolean', 'bool']"
4,any,object dtype,'object'


A convenient dtypes attribute for DataFrame returns a Series with the data type of each column.

In [64]:
dft = pd.DataFrame({'A': np.random.rand(3),
                        'B': 1,
                        'C': 'foo',
                        'D': pd.Timestamp('20010102'),
                        'E': pd.Series([1.0] * 3).astype('float32'),
                        'F': False,
                        'G': pd.Series([1] * 3, dtype='int8')})

In [65]:
dft

Unnamed: 0,A,B,C,D,E,F,G
0,0.557006,1,foo,2001-01-02,1.0,False,1
1,0.023064,1,foo,2001-01-02,1.0,False,1
2,0.644632,1,foo,2001-01-02,1.0,False,1


In [66]:
dft.dtypes

A           float64
B             int64
C            object
D    datetime64[ns]
E           float32
F              bool
G              int8
dtype: object

Series has the same attribute as well:

In [67]:
dft['A'].dtype

dtype('float64')

Pandas has two ways of storing strings.

1. ```object dtype```, which can hold any Python object, including strings.
2. ```StringDtype```, which is dedicated to strings (introduced in 2020, only in the Pandas 1.0.0 version)

It is recommended to use StringDtype for strings because an object can hide any data type inside.

In [68]:
# string data forces an ``object`` dtype
pd.Series([1, 2, 3, 6., 'foo'])

0      1
1      2
2      3
3    6.0
4    foo
dtype: object

## Converting

You can use the ```astype()``` method to explicitly convert ```dtypes``` from one to another. These will by default return a copy, even if the ```dtype``` was unchanged (pass copy=False to change this behavior). In addition, they will raise an exception if the ```astype()``` operation is invalid.

In [76]:
df1 = pd.DataFrame(np.random.randn(8, 1), columns=['A'], dtype='float32')
df1

Unnamed: 0,A
0,0.290018
1,-0.177295
2,0.604737
3,-0.247475
4,-2.011821
5,0.468091
6,-0.347684
7,-0.827332


Check the dtypes of df1 again and see the difference.

In [71]:
df1.dtypes 

A    float32
dtype: object

You can `.astype()` on a subset of columns as well, even on a single column, a.k.a. `Series`.

Convert certain columns to a specific `dtype` by passing a dict to `astype()`.

In [79]:
# converstion of dtypes
df1 = df1.astype('float64')

In [80]:
df1.dtypes

A    float64
dtype: object

In [81]:
dft1 = pd.DataFrame({'a': [1, 0, 1], 'b': [4, 5, 6], 'c': [7, 8, 9]})
dft1 = dft1.astype({'a': np.bool, 'c': np.float64})
dft1

Unnamed: 0,a,b,c
0,True,4,7.0
1,False,5,8.0
2,True,6,9.0


In [83]:
dft1.dtypes

a       bool
b      int64
c    float64
dtype: object