## Introducing Pandas data structures: Series, DataFrames and Index objects.

Pandas is a library built on NumPy which is used for data manipulation, with other ways of indexing other than integers. Series, DataFrame, and index are the basic data structures in this library. Series in pandas can be referred to as a one dimensional array with homogenous elements of different types somewhat similar to numpy arrays; however, it can be indexed differently with specified descriptive labels or integers.  

In [3]:
# convetion for importing pandas
import pandas as pd
import numpy as np
days = pd.Series(['Mondays', 'Tuesday', 'Wednesday'])
print(days)

0      Mondays
1      Tuesday
2    Wednesday
dtype: object


In [5]:
#creating series with numpy array
days_list = np.array(['Monday', 'Tuesday', 'Wednasday'])
numpy_days = pd.Series(days_list)
print(numpy_days)

0       Monday
1      Tuesday
2    Wednasday
dtype: object


In [9]:
# using strings as index
days = pd.Series(['Monday', 'Tuesday', 'Wednesday'],
                index = ['a', 'b', 'c'])
print(days)

# create series from a dictionary
days1 = pd.Series({'a':'Monday', 'b':'Tuesday', 'c':'Wednesday'})
print(days1)

a       Monday
b      Tuesday
c    Wednesday
dtype: object
a       Monday
b      Tuesday
c    Wednesday
dtype: object


In [14]:
print(days[0])
print(days[1:])
print(days['c'])

Monday
b      Tuesday
c    Wednesday
dtype: object
Wednesday


A DataFrame can be described as a table (2 dimensions) made up of many series with the same index. It holds data in rows and columns just like a spreadsheet. Series, dictionaries, lists, other dataframes, and numpy arrays can be used to create new ones.

In [27]:
print(pd.DataFrame()) #prints an empty dataframe

#create a dataframe from a dictionary
df_dict = {'Country':['Ghana', 'Kenya', 'Nigeria', 'Togo'],
          'Capital':['Accra', 'Nairobi', 'Abuja', 'Lome'],
          'Population':[10000, 8500, 35000, 12000],
          'Age':[60, 70, 80, 75]}

df = pd.DataFrame(df_dict,index = [2, 4, 6, 8])
print(df)

Empty DataFrame
Columns: []
Index: []
   Country  Capital  Population  Age
2    Ghana    Accra       10000   60
4    Kenya  Nairobi        8500   70
6  Nigeria    Abuja       35000   80
8     Togo     Lome       12000   75


In [24]:
df_list = [['Ghana', 'Accra', 10000, 60],
          ['Kenya', 'Nairobi', 8500, 70],
          ['Nigeria', 'Abuja', 35000, 80],
          ['Togo', 'Lome', 12000, 75]]
df1 = pd.DataFrame(df_list, columns = ['Country', 'Capital', 'Population', ' Age'], index=[2, 4, 6, 8])
print(df1)

   Country  Capital  Population   Age
2    Ghana    Accra       10000    60
4    Kenya  Nairobi        8500    70
6  Nigeria    Abuja       35000    80
8     Togo     Lome       12000    75


at, iat, iloc and loc are accessors used to retrieve data in dataframes. iloc selects values from the rows and columns by using integer index to locate positions, while loc selects rows or columns using labels. at and iat are used to retrieve single values such that at uses the column and row labels and iat uses indices.

In [35]:
#select the row at index 3
print(df.iloc[3]) #row

#select row with index label 6
print(df.loc[6])

#select the Capital column
print(df['Capital'])

print(df.at[6, 'Country'])

print(df.iat[1, 0])

Country        Togo
Capital        Lome
Population    12000
Age              75
Name: 8, dtype: object
Country       Nigeria
Capital         Abuja
Population      35000
Age                80
Name: 6, dtype: object
2      Accra
4    Nairobi
6      Abuja
8       Lome
Name: Capital, dtype: object


'Kenya'

Finally, Indexes in pandas are immutable arrays with unique elements. They can also be described as ordered sets for retrieving data in a dataframe and collaborating with multiple dataframes.

1. The important Pandas functionalities: indexing, reindexing, selection, group, drop entities, ranking, sorting, duplicates and indexing by hierarchy.
2. Summary and descriptive statistics: measure of central tendency, measure of dispersion, skewness and kurtosis, correlation and multicollinearity.

In [38]:
df['Population'].sum()
df.mean()
df.describe()

  df.mean()


Unnamed: 0,Population,Age
count,4.0,4.0
mean,16375.0,71.25
std,12499.166639,8.539126
min,8500.0,60.0
25%,9625.0,67.5
50%,11000.0,72.5
75%,17750.0,76.25
max,35000.0,80.0


3. The missing data enigma: Importance, types and handling missing data.

Often, data used for analysis in real life scenarios is incomplete as a result of omission, faulty devices, and many other factors. Pandas represent missing values as NA or NaN which can be filled, removed, and detected with functions like fillna(), dropna(), isnull(), notnull(), replace().

In [43]:
df_dict2 = {'Name':['James', 'Yemen', 'Caro', np.nan],
           'Profession':['Reasearcher', 'Artist', 'Doctor', 'Writer'],
           'Experience':[12, np.nan, 10, 8],
           'Height':[np.nan, 175, 180, 150]}

new_df = pd.DataFrame(df_dict2)
print(new_df)

    Name   Profession  Experience  Height
0  James  Reasearcher        12.0     NaN
1  Yemen       Artist         NaN   175.0
2   Caro       Doctor        10.0   180.0
3    NaN       Writer         8.0   150.0


In [45]:
# check for cells with missing values as True
new_df.isnull()

Unnamed: 0,Name,Profession,Experience,Height
0,False,False,False,True
1,False,False,True,False
2,False,False,False,False
3,True,False,False,False


In [46]:
# remove rows with missing values
new_df.dropna()

Unnamed: 0,Name,Profession,Experience,Height
2,Caro,Doctor,10.0,180.0
