## Pandas 1.0 Features

Pandas 1.0 version has some great new features that every data lover should know. Also, this latest version has __*dropped support for Python 2.7*__ and requires 3.6.1 & above, 3.7, and 3.8. This release has also removed a lot of functionality which was deprecated in previous releases.

Let's see what's new and how it can help us. <br>

__Note:__ As per the documentation (2) and (3) are still considered experimental and their behaviors may change.

### (1) __Converting to Markdown using *to_markdown()*__ 

To use to_markdown() you need to install tabulate in your system. 
      
To install you can use pip command __$pip install tabulate__    

In [8]:
import pandas as pd
df = pd.DataFrame({
                    'Name' : ['John', 'Jim', 'Jacob'],
                    'Age' : [10, 12, 11],
                    'Weight' : [29.5, 30, 32.8]
})

print('\n---------- using head() -----------\n')
print(df.head(),'\n')

print('\n---------- using to_markdown() -----------\n')
print(df.to_markdown())


---------- using head() -----------

    Name  Age  Weight
0   John   10    29.5
1    Jim   12    30.0
2  Jacob   11    32.8 


---------- using to_markdown() -----------

|    | Name   |   Age |   Weight |
|---:|:-------|------:|---------:|
|  0 | John   |    10 |     29.5 |
|  1 | Jim    |    12 |     30   |
|  2 | Jacob  |    11 |     32.8 |


### (2) __Dedicated String Datatype__

Previously strings were stored in object-dtype Numpy arrays. The new release has added a StringDtype, an extension type dedicated to string data. You can use this datatype as pd.StringDtype() or use the alias "string". The string accessor methods like upper(), lower(), split(), count() etc work on string datatype. The accessor methods returning integers will return value with Int64Dtype. 
      
This new feature comes handy, 
  - to store text-only data in a column along with other categorical data. The same datatype for both categorical and string values will not bother us anymore.
  - to select the text-only columns using select_dtypes()

In [2]:
s = pd.Series(['John', 'Jim', 'Jill'], dtype="string")
print('\n-------- Strings -------')
print(s)


-------- Strings -------
0    John
1     Jim
2    Jill
dtype: string


In [3]:
df = pd.read_csv('sample.csv')

print(df.head())

print('\n')
print(df.dtypes)

df['Tweet'] = df['Tweet'].astype("string")
df.dtypes

print(df.select_dtypes("string"))

   Label                                              Tweet
0      0             could go for some ASU right about now 
1      0  Could have been home right now if I would have...
2      0                         could have been there now 
3      0  could have done with more sleep, i think im co...
4      0  Could have eaten another pizza. Did the right ...


Label     int64
Tweet    object
dtype: object
                                                Tweet
0              could go for some ASU right about now 
1   Could have been home right now if I would have...
2                          could have been there now 
3   could have done with more sleep, i think im co...
4   Could have eaten another pizza. Did the right ...
5   could have used a knight in shining armor last...
6   Could i be any happier. Hmmm well maybe just o...
7   Could i just quickly establish that i love @am...
8   Could imagine if starbucks was actually puttin...
9   Could it be? I think I'm finally getting the T..

### (3) __NA scalar to denote missing values__
    
A singleton value is introduced to represent scalar missing values, i.e.__pd.NA__
Earlier pandas used, 
 - np.nan for float data
 - np.nan or None for object-dtype data
 - pd.NaT for datetime-like data. 
 
 __pd.NA__ aims to provide the "missing" indicator which can be used consistently across datatype, and currently used by the nullable integer, boolean and new string datatypes.

In [4]:
s = pd.Series([1, 2, None], dtype="Int64")
print('\n-------- Integer -------')
print(s)

s = pd.Series(["Jim", None, "Jack"], dtype="string")
print('\n-------- String -------')
print(s)


-------- Integer -------
0       1
1       2
2    <NA>
dtype: Int64

-------- String -------
0     Jim
1    <NA>
2    Jack
dtype: string


### (4) __Boolean data type with missing values support__

The default bool datatype can only hold True or False in the column and not missing values. The new BooleanArray can store missing values as well. You can use this datatype as pd.BooleanDtype() or use the alias "boolean".


In [5]:
s = pd.Series([True, None, False, None, False], dtype="bool")
print('\n-------- bool -------')
print(s)

s = pd.Series([True, None, False, None, False], dtype="boolean")
print('\n-------- Boolean -------')
print(s)


-------- bool -------
0     True
1    False
2    False
3    False
4    False
dtype: bool

-------- Boolean -------
0     True
1     <NA>
2    False
3     <NA>
4    False
dtype: boolean


### (5) __convert_dtypes method to ease use of supported extension dtypes__ 

The new methods DataFrame.convert_dtypes() and Series.convert_dtypes() have been added to encourage the use of extenstion datatypes that support pd.NA.


In [6]:
import numpy as np
df = pd.DataFrame({
                    'Name' : ['John', 'Jill', 'Jacob'],
                    'Age' : [10, 12, 14],
                    'is_tall' : [True, True, False]
})

print('------ datatypes ------')
print(df.dtypes)

conv = df.convert_dtypes()

print('\n------ datatypes converted ------')
print(conv.dtypes)


------ datatypes ------
Name       object
Age         int64
is_tall      bool
dtype: object

------ datatypes converted ------
Name        string
Age          Int64
is_tall    boolean
dtype: object


### (6) __Improved DataFrame.info()__

Now that gives us the summary of the Dataframe which is more simpler to read and makes the data exploration process easier.
      

In [7]:
df = pd.read_csv('sample.csv')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Label   16 non-null     int64 
 1   Tweet   16 non-null     object
dtypes: int64(1), object(1)
memory usage: 384.0+ bytes
