<a href="https://colab.research.google.com/github/rnomadic/ML_Learning/blob/main/MyPythonCode/MasterCard-Interview.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Mastercard Interview Questions**

### **Q1 How will you find all the unique values in a column?**

We can use unique() with column name to print the value. For below example we use this.

`df1['sepal length (cm)'].unique()`

In [None]:
import pandas as pd
import numpy as np
from sklearn import datasets

iris = datasets.load_iris()

df1 = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
                   columns= iris['feature_names'] + ['target'])

"""
here np.c_ is actually concatening the values
Translates slice objects to concatenation along the second axis. 
Example
np.c_[np.array([1,2,3]), np.array([4,5,6])]
array([[1, 4],
       [2, 5],
       [3, 6]])

"""


In [None]:
df1.columns

Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)', 'target'],
      dtype='object')

In [None]:
df1['sepal length (cm)'].unique()


array([5.1, 4.9, 4.7, 4.6, 5. , 5.4, 4.4, 4.8, 4.3, 5.8, 5.7, 5.2, 5.5,
       4.5, 5.3, 7. , 6.4, 6.9, 6.5, 6.3, 6.6, 5.9, 6. , 6.1, 5.6, 6.7,
       6.2, 6.8, 7.1, 7.6, 7.3, 7.2, 7.7, 7.4, 7.9])

## **Q2 How will you rename a column?**

We can use rename function like below

`df1.rename(columns={'sepal length (cm)' : 'sepal length'}, inplace=True)`

In [None]:
df1.rename(columns={'sepal length (cm)' : 'sepal length'}, inplace=True)
df1.columns

Index(['sepal length', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)', 'target'],
      dtype='object')

## **Q3. What are various ways to combine two datasets?**

The **concat()** function in pandas is used to append either columns or rows from one DataFrame to another. The concat() function does all the heavy lifting of performing concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.


pd.concat(
    objs,
    axis=0,
    join="outer",
    ignore_index=False,
    keys=None,
    levels=None,
    names=None,
    verify_integrity=False,
    copy=True,
)









In [None]:
# First DataFrame
df1 = pd.DataFrame({'id': ['A01', 'A02', 'A03', 'A04'],
                    'Name': ['ABC', 'PQR', 'DEF', 'GHI']},
                   index=[0, 1, 2, 3])
# Second DataFrame
df2 = pd.DataFrame({'id': ['B05', 'B06', 'B07', 'B08'],
                    'Name': ['XYZ', 'TUV', 'MNO', 'JKL']},
                   index=[4, 5, 6, 7])
  
frames = [df1, df2]
result = pd.concat(frames)
display(result)

Unnamed: 0,id,Name
0,A01,ABC
1,A02,PQR
2,A03,DEF
3,A04,GHI
4,B05,XYZ
5,B06,TUV
6,B07,MNO
7,B08,JKL


In [None]:
## Joining
import pandas as pd
  
df1 = pd.DataFrame({'id': ['A01', 'A02', 'A03', 'A04'],
                    'Name': ['ABC', 'PQR', 'DEF', 'GHI']})
  
df3 = pd.DataFrame({'City': ['MUMBAI', 'PUNE', 'MUMBAI', 'DELHI'],
                    'Age': ['12', '13', '14', '12']})
  
# the default behaviour is join='outer'
# inner join
  
result = pd.concat([df1, df3], axis=1, join='inner')
display(result)

Unnamed: 0,id,Name,City,Age
0,A01,ABC,MUMBAI,12
1,A02,PQR,PUNE,13
2,A03,DEF,MUMBAI,14
3,A04,GHI,DELHI,12


## **Q4. How will you load a dataset which is too large in size to hold in memory?**

https://towardsdatascience.com/what-to-do-when-your-data-is-too-big-for-your-memory-65c84c600585

### Technique 1: Lossless Compression
#### Load data by columns


In [None]:
#Import needed library
import pandas as pd
#Dataset
csv = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
#Load entire dataset
data = pd.read_csv(csv)
data.info(verbose=False, memory_usage="deep")
print()
#Load only two columns
df_2col = pd.read_csv(csv , usecols=["county", "cases"])
df_2col.info(verbose=False, memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2502832 entries, 0 to 2502831
Columns: 6 entries, date to deaths
dtypes: float64(2), int64(1), object(3)
memory usage: 526.0 MB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2502832 entries, 0 to 2502831
Columns: 2 entries, county to cases
dtypes: int64(1), object(1)
memory usage: 172.3 MB


### Technique 2: Manipulate datatypes
int8 can store integers from -128 to 127.

int16 can store integers from -32768 to 32767.

int64 can store integers from -9223372036854775808 to 9223372036854775807.

In [None]:
df = pd.read_csv(csv, usecols=["county", "cases"])
df["cases"].memory_usage(index=False, deep=True)

20022656

In [None]:
df = pd.read_csv(csv, usecols=["county", "cases"], dtype={"cases" : "int16"})
df["cases"].memory_usage(index=False, deep=True)

## You can see 4 times reduction

5005664

### Technique 3: Chunking
That is cutting a large dataset into smaller chunks and then processing those chunks individually. After all the chunks have been processed, you can compare the results and calculate the final findings.

In [None]:
import pandas as pd
#Dataset 
csv = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
#Loop different chuncks and get the max of each one 
result = {}
for chunk in pd.read_csv(csv, chunksize=100):
    max_case = chunk["cases"].max()
    max_case_county = chunk.loc[chunk['cases'] == max_case, 'county'].iloc[0]
    result[max_case_county] = max_case
#Display results
print(max(result, key=result.get) , result[max(result, key=result.get)])

Los Angeles 2908425


### Technique 4: Indexing
The above code took about 2 mins to execute.

Chunking is excellent if you need to load your dataset only once, but if you want to load multiple datasets, then indexing is the way to go.

For example, let’s say I want to get the cases for a specific state. In this case, chunking would make sense; I could write a simple function that accomplishes that.

In [None]:
def get_state_info(name):
  csv = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
  return pd.concat(df[df['state']==name] for df in 
                   pd.read_csv(csv, chunksize=100)
                   )

So, my small function loads all the rows in each chunk but only cares about the ones for the state I want. **That leads to significant overhead**. I can avoid having this by using a database next to Pandas. The simplest one I can use is **SQLite.**

And then re-write the get_state_info function

In [None]:
## I first need to load my data frame into an SQLite database.
import sqlite3
csv = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
# Create a new database file:
db = sqlite3.connect("cases.sqlite")
# Load the CSV in chunks:
for c in pd.read_csv(csv, chunksize=100):
    # Append all rows to a new database table
    c.to_sql("cases", db, if_exists="append")
# Add an index on the 'state' column:
db.execute("CREATE INDEX state ON cases(state)") 
db.close()

In [None]:
def get_state_info(name):
  connection = sqlite3.connect("cases.sqlite")
  query = "select * from cases where state= ?"
  values= (name)
  return pd.read_sql_query(query, connection, values)

## **Q5.How will you create a new column whose value is calculated from two other columns**

1. Create a new column by assigning the output to the DataFrame with a new column name in between the [].

2. Operations are element-wise, no need to loop over rows.

3. Use rename with a dictionary or function to rename row labels or column names.


In [None]:
df = pd.DataFrame({'Date':['10/2/2011', '11/2/2011', '12/2/2011', '13/2/2011'],
                    'Event':['Music', 'Poetry', 'Theatre', 'Comedy'],
                    'Cost':[10000, 5000, 15000, 2000]})
  
# Print the dataframe
print(df)

        Date    Event   Cost
0  10/2/2011    Music  10000
1  11/2/2011   Poetry   5000
2  12/2/2011  Theatre  15000
3  13/2/2011   Comedy   2000


In [None]:
df["Discounted Price"] = df.apply(lambda row: row.Cost - (row.Cost*0.1), axis=1)
print(df)

        Date    Event   Cost  Discounted Price
0  10/2/2011    Music  10000            9000.0
1  11/2/2011   Poetry   5000            4500.0
2  12/2/2011  Theatre  15000           13500.0
3  13/2/2011   Comedy   2000            1800.0


## **Q6. What is the difference between loc() and iloc()**


`ipcdf=ipcdf.iloc[ipcdf['ACTION_TIME'].dropna().index,:]`

`ipcdf.loc[ipcdf['CATEGORY'].isin(['Grp-Weight']),'CATEGORY']='Group Weight'`

**loc()** is **label based** data selecting method which means that we have to pass the name of the row or column which we want to select. This method includes the last element of the range passed in it, unlike iloc(). loc() can accept the boolean data unlike iloc() . Many operations can be performed using the loc() method like-

1. Selecting data according to some conditions :

`display(data.loc[(data.Brand == 'Maruti') & (data.Mileage > 25)])`

2. Selecting a range of rows from the DataFrame :

`display(data.loc[2 : 5])`

3. Updating the value of any column :

`data.loc[(data.Year < 2015), ['Mileage']] = 22`


**iloc() :** iloc() is a **indexed based** selecting method which means that we have to pass integer index in the method to select specific row/column. This method does not include the last element of the range passed in it unlike loc(). iloc() does not accept the boolean data unlike loc(). Operations performed using iloc() are:

1. Selecting rows using integer indices:

`display(data.iloc[[0, 2, 4, 7]])`

2. Selecting a range of columns and rows simultaneously:

`display(data.iloc[1 : 5, 2 : 5])`



## **Q7. How will you calculate the total number of null values in a dataset**

**Counting NaN in a column :**

`data['B'].isnull().sum()`

In [None]:

import pandas as pd
import numpy as np
    
# dictionary of lists 
dict = { 'A':[1, 4, 6, 9], 
        'B':[np.NaN, 5, 8, np.NaN], 
        'C':[7, 3, np.NaN, 2],
        'D':[1, np.NaN, np.NaN, np.NaN] } 
    
# creating dataframe from the
# dictionary 
data = pd.DataFrame(dict) 
print(data.head())

# total NaN values in column 'B'
print()
print('total NaN values in col B =',data['B'].isnull().sum())

   A    B    C    D
0  1  NaN  7.0  1.0
1  4  5.0  3.0  NaN
2  6  8.0  NaN  NaN
3  9  NaN  2.0  NaN

total NaN values in col B = 2


**Counting NaN in a row :**

The row can be selected using loc or iloc. Then we find the sum as before.



In [None]:
display(data.loc[1, :].isnull().sum())
display(data.loc[2, :].isnull().sum())

1

2

## **Q8. In a dataset, there is a sex column with the following unique values – Male, Female, M, F, other. How will you change these values to make this column consistent with a supervised model?**

`df.loc[df[<some_column_name>] == <condition>, [<another_column_name>]] = <value_to_add>`

where **some_column_name** is the column you want to check the **condition** variable against and **another_column_name** is the column you want to add to (can be a new column or one that already exists or the same column that you want to replace the value). **value_to_add** is the value you want to add to that column/row.

`df.loc[df['Sex'].isin(['M']),'Sex']='Male'`

Above code will replace M with Male.

## **Q9. In a dataset with an age column, extract all those rows with ages between 18 and 60?**

1. Technique 1 - using condition directly
2. Technique 2 - using isin() function

In [None]:
import pandas as pd
  
record = {
  
 'Name': ['Ankit', 'Amit', 'Aishwarya', 'Priyanka', 'Priya', 'Shaurya' ],
 'Age': [21, 19, 20, 18, 17, 21],
 'Stream': ['Math', 'Commerce', 'Science', 'Math', 'Math', 'Science'],
 'Percentage': [88, 92, 95, 70, 65, 78] }
  
# create a dataframe
dataframe = pd.DataFrame(record, columns = ['Name', 'Age', 'Stream', 'Percentage'])

dataframe.head()

Unnamed: 0,Name,Age,Stream,Percentage
0,Ankit,21,Math,88
1,Amit,19,Commerce,92
2,Aishwarya,20,Science,95
3,Priyanka,18,Math,70
4,Priya,17,Math,65


In [None]:
df = dataframe[dataframe["Age"] < 20]
df

Unnamed: 0,Name,Age,Stream,Percentage
1,Amit,19,Commerce,92
3,Priyanka,18,Math,70
4,Priya,17,Math,65


In [None]:
options = ['Math', 'Commerce']
dataframe.loc[dataframe["Stream"].isin(options)]
dataframe

Unnamed: 0,Name,Age,Stream,Percentage
0,Ankit,21,Math,88
1,Amit,19,Commerce,92
2,Aishwarya,20,Science,95
3,Priyanka,18,Math,70
4,Priya,17,Math,65
5,Shaurya,21,Science,78


## **Q10. What is the use of .append() and .extend() on a list?**

Python’s **append() **function inserts a **single element** into an existing list. The element will be added to the **end of the old list** rather than being returned to a new list. Adds its argument as a single element to the end of a list. The **length of the list increases by one**. 

In [None]:
my_list = ['geeks', 'for']
my_list.append('geeks')
print(my_list)

another_list = [6, 4, 7, 1]
my_list.append(another_list)
print(my_list)

['geeks', 'for', 'geeks']
['geeks', 'for', 'geeks', [6, 4, 7, 1]]


**extend()** Iterates over its argument and adding each element to the list and extending the list. The **length** of the list **increases** by a **number of elements** in its argument.

Time Complexity: Append has constant time complexity i.e.,O(1). Extend has a time complexity of O(k). Where k is the length of the list which need to be added.



In [None]:
my_list = ['geeks', 'for']
another_list = [6, 0, 4, 1]
my_list.extend(another_list)
print(my_list)

my_list = ['geeks', 'for', 6, 0, 4, 1]
my_list.extend('geeks')
print(my_list)

['geeks', 'for', 6, 0, 4, 1]
['geeks', 'for', 6, 0, 4, 1, 'g', 'e', 'e', 'k', 's']


## **Q11. Which technique did you use to fill the null values and why**

    1. Deleting Rows with missing values
    2. Impute missing values for continuous variable
    3. Impute missing values for categorical variable
    4. Other Imputation Methods
    5. Using Algorithms that support missing values
    6. Prediction of missing values
    7. Imputation using Deep Learning Library — Datawig

    https://towardsdatascience.com/7-ways-to-handle-missing-values-in-machine-learning-1a6326adf79e
    