[TenMinutes with Pands](https://github.com/reddyprasade/Pandas-with-Python/blob/master/1.%20Ten%20Minutes%20to%20Pandas.ipynb)

## 10 Minutes to pandas

This is a short introduction to pandas, geared mainly for new users.

we import as follows:

In [2]:
import numpy as np
import pandas as pd

### Object Creation

Creating a Series by passing a list of values, letting pandas create a default integer index:

In [6]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64


Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:

In [10]:
dates = pd.date_range('20130101', periods=6)
print(dates)

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')


In [12]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df)

                   A         B         C         D
2013-01-01 -0.169927  0.006401  0.952059 -0.704590
2013-01-02 -0.351168  0.193532  1.445028 -0.019720
2013-01-03 -0.837599  0.411161  0.999155  0.082159
2013-01-04  0.175181  0.097161 -1.961902 -0.422568
2013-01-05 -0.265802  0.863255  0.674600  0.623468
2013-01-06 -0.520628 -1.663890  0.225055  0.560485


Creating a DataFrame by passing a dict of objects that can be converted to series-like.

In [26]:
df2 = pd.DataFrame({ 'A' : 1.,
                        'B' : pd.Timestamp('20180102'),
                        'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                        'D' : np.array([3] * 4,dtype='int32'),
                        'E' : pd.Categorical(["test","train","test","train"]),
                        'F' : 'foo' })
print(df2)

     A          B    C  D      E    F
0  1.0 2018-01-02  1.0  3   test  foo
1  1.0 2018-01-02  1.0  3  train  foo
2  1.0 2018-01-02  1.0  3   test  foo
3  1.0 2018-01-02  1.0  3  train  foo


The columns of the resulting DataFrame have different dtypes.

In [27]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

### Viewing Data

Here is how to view the top and bottom rows of the frame:

In [28]:
df.head()

Unnamed: 0,A,B,C,D
2013-01-01,-0.169927,0.006401,0.952059,-0.70459
2013-01-02,-0.351168,0.193532,1.445028,-0.01972
2013-01-03,-0.837599,0.411161,0.999155,0.082159
2013-01-04,0.175181,0.097161,-1.961902,-0.422568
2013-01-05,-0.265802,0.863255,0.6746,0.623468


In [29]:
df.tail(3)

Unnamed: 0,A,B,C,D
2013-01-04,0.175181,0.097161,-1.961902,-0.422568
2013-01-05,-0.265802,0.863255,0.6746,0.623468
2013-01-06,-0.520628,-1.66389,0.225055,0.560485


Display the index, columns:

In [31]:
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [33]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

DataFrame.to_numpy() gives a NumPy representation of the underlying data. 

Note that his can be an expensive operation when your DataFrame has columns with different data types, which comes down to a 

#####  fundamental difference between pandas and NumPy: 
NumPy arrays have **one dtype for the entire array**, while pandas DataFrames have **one dtype per column**. 

When you call DataFrame.to_numpy(), pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. 

This may end up being object, which requires casting every value to a Python object.

For df, our DataFrame of all floating-point values, DataFrame.to_numpy() is fast and doesnâ€™t require copying data.

In [35]:
df.to_numpy()

array([[-0.16992707,  0.0064013 ,  0.95205928, -0.70458983],
       [-0.35116771,  0.19353224,  1.44502808, -0.01972007],
       [-0.83759852,  0.41116114,  0.99915485,  0.08215888],
       [ 0.17518094,  0.0971606 , -1.9619022 , -0.42256843],
       [-0.26580194,  0.86325456,  0.67460017,  0.62346811],
       [-0.52062811, -1.6638902 ,  0.22505495,  0.56048546]])

For df2, the DataFrame with multiple dtypes, DataFrame.to_numpy() is relatively expensive.

In [37]:
df2.to_numpy()

array([[1.0, Timestamp('2018-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2018-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2018-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2018-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

**Note** DataFrame.to_numpy() does not include the index or column labels in the output.

describe() shows a quick statistic summary of your data:

In [39]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.328324,-0.015397,0.388999,0.019872
std,0.340405,0.863517,1.219812,0.525859
min,-0.837599,-1.66389,-1.961902,-0.70459
25%,-0.478263,0.029091,0.337441,-0.321856
50%,-0.308485,0.145346,0.81333,0.031219
75%,-0.193896,0.356754,0.987381,0.440904
max,0.175181,0.863255,1.445028,0.623468


Transposing your data:

In [41]:
df.T

Unnamed: 0,2013-01-01 00:00:00,2013-01-02 00:00:00,2013-01-03 00:00:00,2013-01-04 00:00:00,2013-01-05 00:00:00,2013-01-06 00:00:00
A,-0.169927,-0.351168,-0.837599,0.175181,-0.265802,-0.520628
B,0.006401,0.193532,0.411161,0.097161,0.863255,-1.66389
C,0.952059,1.445028,0.999155,-1.961902,0.6746,0.225055
D,-0.70459,-0.01972,0.082159,-0.422568,0.623468,0.560485


Sorting by an axis:

In [43]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2013-01-01,-0.70459,0.952059,0.006401,-0.169927
2013-01-02,-0.01972,1.445028,0.193532,-0.351168
2013-01-03,0.082159,0.999155,0.411161,-0.837599
2013-01-04,-0.422568,-1.961902,0.097161,0.175181
2013-01-05,0.623468,0.6746,0.863255,-0.265802
2013-01-06,0.560485,0.225055,-1.66389,-0.520628


Sorting by values:

In [45]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2013-01-06,-0.520628,-1.66389,0.225055,0.560485
2013-01-01,-0.169927,0.006401,0.952059,-0.70459
2013-01-04,0.175181,0.097161,-1.961902,-0.422568
2013-01-02,-0.351168,0.193532,1.445028,-0.01972
2013-01-03,-0.837599,0.411161,0.999155,0.082159
2013-01-05,-0.265802,0.863255,0.6746,0.623468


### Selection

**Note:** While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, .at, .iat, .loc and .iloc.

See the indexing documentation Indexing and Selecting Data and MultiIndex / Advanced Indexing.

### Getting

Selecting a single column, which yields a Series, equivalent to df.A:

In [48]:
df['A']

2013-01-01   -0.169927
2013-01-02   -0.351168
2013-01-03   -0.837599
2013-01-04    0.175181
2013-01-05   -0.265802
2013-01-06   -0.520628
Freq: D, Name: A, dtype: float64

Selecting via [], which slices the rows.

In [50]:
df[0:3]

Unnamed: 0,A,B,C,D
2013-01-01,-0.169927,0.006401,0.952059,-0.70459
2013-01-02,-0.351168,0.193532,1.445028,-0.01972
2013-01-03,-0.837599,0.411161,0.999155,0.082159


In [52]:
df['20130102':'20130104']

Unnamed: 0,A,B,C,D
2013-01-02,-0.351168,0.193532,1.445028,-0.01972
2013-01-03,-0.837599,0.411161,0.999155,0.082159
2013-01-04,0.175181,0.097161,-1.961902,-0.422568


### Selection by Label

In [54]:
# For getting a cross section using a label:
df.loc[dates[0]]

A   -0.169927
B    0.006401
C    0.952059
D   -0.704590
Name: 2013-01-01 00:00:00, dtype: float64

In [56]:
# Selecting on a multi-axis by label:
df.loc[:, ['A', 'B']]

Unnamed: 0,A,B
2013-01-01,-0.169927,0.006401
2013-01-02,-0.351168,0.193532
2013-01-03,-0.837599,0.411161
2013-01-04,0.175181,0.097161
2013-01-05,-0.265802,0.863255
2013-01-06,-0.520628,-1.66389


In [58]:
# Showing label slicing, both endpoints are included:
df.loc['20130102':'20130104', ['A', 'B']]

Unnamed: 0,A,B
2013-01-02,-0.351168,0.193532
2013-01-03,-0.837599,0.411161
2013-01-04,0.175181,0.097161


In [61]:
# Reduction in the dimensions of the returned object:
df.loc['20130102', ['A', 'B']]

A   -0.351168
B    0.193532
Name: 2013-01-02 00:00:00, dtype: float64

In [63]:
# For getting a scalar value:
df.loc[dates[0], 'A']

-0.16992706950573147

In [64]:
# For getting fast access to a scalar (equivalent to the prior method):
df.at[dates[0], 'A']

-0.16992706950573147