# Srting/Text methods
### .str.method()

In [1]:
import numpy as np
import pandas as pd

**Two ways to store data in pandas:**
1. object-dtype NumPy array  
- can store not only strings but also mixed data including null values
- default datatype for strings

2. stringDtype  
- need to explicitely state while using it
- new datatype for strings
- decreased memory consumption

In [2]:
#splitting email to find name

email = 'ritika@email.com'

In [3]:
email.split('@')

['ritika', 'email.com']

In [4]:
email.split('@')[0]

'ritika'

In [23]:
#String methods on a pandas series

names = pd.Series(['ritika','shruti','june','bombay','5'])
names

0    ritika
1    shruti
2      june
3    bombay
4         5
dtype: object

In [24]:
#capitalize first alphabet of every word
names.str.capitalize()

0    Ritika
1    Shruti
2      June
3    Bombay
4         5
dtype: object

In [14]:
#uppercase
names.str.upper()

0    RITIKA
1    SHRUTI
2      JUNE
3    BOMBAY
4         5
dtype: object

In [16]:
#counting number of time a character occurs

names.str.count('r')

0    1
1    1
2    0
3    0
4    0
dtype: int64

In [18]:
#checking if data is a digit
names.str.isdigit()

0    False
1    False
2    False
3    False
4     True
dtype: bool

In [19]:
#printing only rows where data is a digit
names[names.str.isdigit()]

4    5
dtype: object

### Splitting, grabbing and expanding series into a dataframe

In [25]:
tech_finance = ['GOOG,APPL,AMZN','JPM,BAC,GS']

In [26]:
tickers = pd.Series(tech_finance)
tickers

0    GOOG,APPL,AMZN
1        JPM,BAC,GS
dtype: object

In [29]:
# SPLITTING

tickers.str.split(',')

0    [GOOG, APPL, AMZN]
1        [JPM, BAC, GS]
dtype: object

In [30]:
# GRABBING first row from this split dataframe
tickers.str.split(',')[0]

['GOOG', 'APPL', 'AMZN']

In [31]:
# grabbing first item from each row from the split dataframe
tickers.str.split(',').str[0]

0    GOOG
1     JPM
dtype: object

In [34]:
# expanding the split data into a datframe by making seperate columns
df = tickers.str.split(',',expand=True)
df

Unnamed: 0,0,1,2
0,GOOG,APPL,AMZN
1,JPM,BAC,GS


In [35]:
# checking if the table above is a pandas dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       2 non-null      object
 1   1       2 non-null      object
 2   2       2 non-null      object
dtypes: object(3)
memory usage: 176.0+ bytes


### Basic text cleaning by stacking str methods

In [98]:
messy_names = pd.Series(['andrew  ' , 'rit;ika' , '  claire  '])
messy_names

0      andrew  
1       rit;ika
2      claire  
dtype: object

In [99]:
#replacing semicolon
messy_names_new = messy_names.str.replace(';','')
messy_names_new

0      andrew  
1        ritika
2      claire  
dtype: object

In [100]:
#strpping the white spaces
messy_names_new.str.strip()

0    andrew
1    ritika
2    claire
dtype: object

In [101]:
#above output in a single line

messy_names.str.replace(';','').str.strip()

0    andrew
1    ritika
2    claire
dtype: object

In [105]:
# CLEANING AND CAPITALIZING IN A SINGLE COMMAND

messy_names.str.replace(';','').str.strip().str.capitalize()

0    Andrew
1    Ritika
2    Claire
dtype: object

In [106]:
# USING apply() method for the ABOVE LENGTHY CODE
# CLEANING AND CAPITALIZING DATA USING APPLY() METHOD

# NOTE: we do not need to write .str while using apply() because 
# apply() is a method for pandas dataframes which works row wise by default

def cleanup(name):
    name = name.replace(';','')
    name = name.strip()
    name = name.capitalize()
    return name

messy_names.apply(cleanup)

0    Andrew
1    Ritika
2    Claire
dtype: object

In [108]:
# NOTE:
# vectorize() is the fastest method
# apply() is the next fast method
# string methods are the slowest