# Creating Dataframe Using Pandas Library

Pandas is a software library in python that is used mostly for the machine learning purpose.The software is as open source library. Library can be easily installed by running the pip command in the command prompt/python that is 

**pip install pandas**

You can even install it into the Jupyter notebook (IDE) by just running the above command in the cell itself.



<h3>Using List of list</h3> 

Now we will create a dataframe using the list of list as we can see in the below code where there are three different list inside list. Usually we call this type of list as the **nested list** . 


In [1]:
# Import pandas library 
import pandas as pd 
  
# defining the lists 
data = [['Jio', 2017], ['Shell', 2000], ['Bharat Petroleum', 14]] 
company_name =  ['Company_Name', 'Year_Established']
Index = [1,2,3]

# Create the pandas DataFrame by passing the required indexes 
df = pd.DataFrame(data, columns = company_name , index = Index ) 
  
# print dataframe. 
print(df) 

       Company_Name  Year_Established
1               Jio              2017
2             Shell              2000
3  Bharat Petroleum                14


<h3>Using Dictionary</h3>

We can create a dataframe using the the dictionary as shown below with key and values, where values are a list of strig and integers. Values can be of any dataset.

In [3]:
import pandas as pd 
  
# intialise data of lists. 
data = {'Car_Model':['passat', 'i20', 'i10', 'Audi_R8'], 'Price':[11, 8, 7, 180]} 
  
# Create DataFrame 
df = pd.DataFrame(data) 
  
# Print the output. 
df 

Unnamed: 0,Car_Model,Price
0,passat,11
1,i20,8
2,i10,7
3,Audi_R8,180


<h3>Creating datafame using the numpy library.</h3>

Pandas and numpy library always goes hand in hand. In the below code the array of size 5*5 will pe produced and then it will be converted into the dataframe using the pandas library.  
We can pass the column name as the list as shown in the below code. where the a,b,c,d,e will be the respective name of the columns, which can be seen in the output of the code. 


In [4]:
import numpy as np
df_a = pd.DataFrame(np.random.randint(0, 3, (5, 5)), columns=list('abcde'))
df_a

Unnamed: 0,a,b,c,d,e
0,1,0,1,2,0
1,1,1,2,0,0
2,1,0,0,1,0
3,0,0,1,0,0
4,0,1,0,0,2


**Another Method**

We can pass the name of the column using the for loop if we want to provide continous values to the column name.

In [5]:
x=np.random.randint(0,2,(4,5))
df = pd.DataFrame(x, columns = (i for i in range (1,6))  )
df

Unnamed: 0,1,2,3,4,5
0,1,0,1,0,1
1,0,0,1,0,0
2,0,0,0,1,1
3,0,0,0,1,1


# Deep diving into the pandas for feature selection.

In [6]:
import pandas as pd
import numpy as np

data = {'birds': ['Cranes', 'Cranes', 'plovers', 'spoonbills', 'spoonbills', 'Cranes', 'plovers', 'Cranes', 'spoonbills', 'spoonbills'], 
        'age': [3.5, 4, 1.5, np.nan, 6, 3, 5.5, np.nan, 8, 4], 
        'visits': [2, 4, 3, 4, 3, 4, 2, 2, 3, 2], 
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

**1. Create a DataFrame birds from this dictionary data which has the index labels.**

In [7]:
bird = pd.DataFrame(data, index =labels)
bird

Unnamed: 0,birds,age,visits,priority
a,Cranes,3.5,2,yes
b,Cranes,4.0,4,yes
c,plovers,1.5,3,no
d,spoonbills,,4,yes
e,spoonbills,6.0,3,no
f,Cranes,3.0,4,no
g,plovers,5.5,2,no
h,Cranes,,2,yes
i,spoonbills,8.0,3,no
j,spoonbills,4.0,2,no


**3. Checking for the null values and printing the respective rows**



In [8]:
null = bird[bird.isna().any(axis =1)]
print(null)

        birds  age  visits priority
d  spoonbills  NaN       4      yes
h      Cranes  NaN       2      yes


**4. Checking for null values in specific row and printing the values.**


In [9]:
null_1 = bird[bird['age'].isna()]
print(null_1)

        birds  age  visits priority
d  spoonbills  NaN       4      yes
h      Cranes  NaN       2      yes


**5. Print the first 2 rows of the birds dataframe**

In [10]:
bird[:2]

Unnamed: 0,birds,age,visits,priority
a,Cranes,3.5,2,yes
b,Cranes,4.0,4,yes


**6. Print all the rows with only 'birds' and 'age' columns from the dataframe**

In [11]:
bird[['birds','age']]

Unnamed: 0,birds,age
a,Cranes,3.5
b,Cranes,4.0
c,plovers,1.5
d,spoonbills,
e,spoonbills,6.0
f,Cranes,3.0
g,plovers,5.5
h,Cranes,
i,spoonbills,8.0
j,spoonbills,4.0


**7. select [2, 3, 7] rows and in columns ['birds', 'age', 'visits']**

In [12]:
bird.loc[['b','c','g'],['birds','age','visits']]

Unnamed: 0,birds,age,visits
b,Cranes,4.0,4
c,plovers,1.5,3
g,plovers,5.5,2


**8. select the rows where the number of visits is less than 4**

In [13]:
q6 = bird[bird['visits']<4]
q6

Unnamed: 0,birds,age,visits,priority
a,Cranes,3.5,2,yes
c,plovers,1.5,3,no
e,spoonbills,6.0,3,no
g,plovers,5.5,2,no
h,Cranes,,2,yes
i,spoonbills,8.0,3,no
j,spoonbills,4.0,2,no


**9. select the rows with columns ['birds', 'visits'] where the age is missing i.e NaN**

In [14]:
b1= bird[bird['age'].isnull()]
b1[['birds','visits']]

Unnamed: 0,birds,visits
d,spoonbills,4
h,Cranes,2


**10. Select the rows where the birds is a Cranes and the age is less than 4**

In [15]:
q6[q6['birds']== 'Cranes']

Unnamed: 0,birds,age,visits,priority
a,Cranes,3.5,2,yes
h,Cranes,,2,yes


**11. Select the rows the age is between 2 and 4(inclusive)**

In [16]:
bird[(bird['age']>=2) & (bird['age']<4)]

Unnamed: 0,birds,age,visits,priority
a,Cranes,3.5,2,yes
f,Cranes,3.0,4,no


**12. Find the total number of visits of the bird Cranes**

In [17]:
q10 = bird[bird['birds']== 'Cranes']
q10['visits'].sum()

12

**13. Calculate the mean age for each different birds in dataframe.**

In [18]:
q11 = bird.groupby('birds').mean()
q11['age']

birds
Cranes        3.5
plovers       3.5
spoonbills    6.0
Name: age, dtype: float64

**14. Append a new row 'k' to dataframe with your choice of values for each column. Then delete that row to return the original DataFrame.**

In [19]:
# lst = [{'birds':'Cranes','age':2, 'visits':'2.0', 'priority':'yes'}]
# bird.append(lst)
bird.loc['k'] = ['Cranes', 2, 3, 'yes']
bird
bird.drop(['k'])

Unnamed: 0,birds,age,visits,priority
a,Cranes,3.5,2,yes
b,Cranes,4.0,4,yes
c,plovers,1.5,3,no
d,spoonbills,,4,yes
e,spoonbills,6.0,3,no
f,Cranes,3.0,4,no
g,plovers,5.5,2,no
h,Cranes,,2,yes
i,spoonbills,8.0,3,no
j,spoonbills,4.0,2,no


**15. Find the number of each type of birds in dataframe (Counts)**

In [20]:
bird['birds'].value_counts()

Cranes        5
spoonbills    4
plovers       2
Name: birds, dtype: int64

**16. Sort dataframe (birds) first by the values in the 'age' in decending order, then by the value in the 'visits' column in ascending order.**

In [21]:
bird.sort_values(['age','visits'], ascending = [False, True])

Unnamed: 0,birds,age,visits,priority
i,spoonbills,8.0,3,no
e,spoonbills,6.0,3,no
g,plovers,5.5,2,no
j,spoonbills,4.0,2,no
b,Cranes,4.0,4,yes
a,Cranes,3.5,2,yes
f,Cranes,3.0,4,no
k,Cranes,2.0,3,yes
c,plovers,1.5,3,no
h,Cranes,,2,yes


**17. Replace the priority column values with'yes' should be 1 and 'no' should be 0**

In [22]:
bird.replace(to_replace=['yes','no'], value = [1,0])

Unnamed: 0,birds,age,visits,priority
a,Cranes,3.5,2,1
b,Cranes,4.0,4,1
c,plovers,1.5,3,0
d,spoonbills,,4,1
e,spoonbills,6.0,3,0
f,Cranes,3.0,4,0
g,plovers,5.5,2,0
h,Cranes,,2,1
i,spoonbills,8.0,3,0
j,spoonbills,4.0,2,0


**18. In the 'birds' column, change the 'Cranes' entries to 'trumpeters'.**

In [23]:
bird.replace(to_replace='Cranes', value ='trumpeters')

Unnamed: 0,birds,age,visits,priority
a,trumpeters,3.5,2,yes
b,trumpeters,4.0,4,yes
c,plovers,1.5,3,no
d,spoonbills,,4,yes
e,spoonbills,6.0,3,no
f,trumpeters,3.0,4,no
g,plovers,5.5,2,no
h,trumpeters,,2,yes
i,spoonbills,8.0,3,no
j,spoonbills,4.0,2,no
