# Intro to Data Wrangling
* Data wrangling (also known as data munging) refers to the process of cleaning, transforming, and preparing raw data for analysis or machine learning tasks.
* It involves tasks such as handling missing values, correcting errors in the data, converting data formats, aggregating data, and ensuring the data is in a usable structure.

__NOTE:__
- __This is an introductory session. We will deep dive into data analysis with real data post midterm exam 1.__
- __After midterm 1 the course shifts to applications with real world data analysis, with databases and web development (the topics maybe intertwined).__
- __These concepts will not be part of midterm exam 1__


# Overview of Data Wrangling
* Data wrangling is a crucial step in any data analysis process. Raw data often comes from various sources such as CSV files, databases, APIs, and web scraping.
* These datasets often contain missing or erroneous data, inconsistencies, and unstructured formats.
* The goal of data wrangling is to convert this raw data into a clean, structured format that can be used effectively for analysis or machine learning.

It includes tasks like:
- Cleaning: Removing errors or inconsistencies.
- Transforming: Changing data types, values, or structures.
- Consolidating: Merging different data sources.
- Filtering: Selecting relevant data points.
- Feature Engineering: Creating new features from existing data.

# Use cases
Why is it Important?
- Data collected from various sources such as web scraping, APIs, and databases often come in raw forms with inconsistencies.
- Cleaning and transforming this data is necessary for obtaining actionable insights.
- You cannot perform meaningful analysis, dervie conclusions, or build machine learning models without this step.


# Python libraries for Data Wrangling

There are many python libraries that can be used for this purpose. We will look at some of the popular ones and also some niche libraries that might be useful for specific types of data.



### Pandas: The Core Library
Pandas provides two main data structures:
- DataFrame: A two-dimensional table with labeled axes (rows and columns).
- Series: A one-dimensional array with labeled indices.
Pandas simplifies data manipulation and preparation through a rich set of functionalities.
Basic Operations:
* pd.read_csv() to read data from CSV files.
* DataFrame.dropna() to remove missing data.
* DataFrame.fillna() to replace missing values.
* DataFrame.groupby() for aggregation.


### NumPy: Handling Numerical Data
NumPy is used for numerical computing, and it is especially useful for working with arrays. It supports operations like mathematical calculations, broadcasting, and reshaping data.
Key Functions:
* np.array() to create NumPy arrays.
* np.mean(), np.median() for basic statistics. (many more like this)

### Matplotlib & Seaborn: Data Visualization
Data wrangling is often followed by visualizing the results to ensure data correctness.
* Matplotlib: Provides low-level plotting functionalities.
* Seaborn: Built on top of Matplotlib and offers advanced statistical visualizations.

### Regex (re module, already covered)
* For string manipulation, regular expressions (Regex) allow you to match and extract patterns from text data. For example, removing extra spaces, validating email addresses, or extracting dates.




# Simple Example Intro to pandas usage for data wrangling tasks

We will use messy data csv files as example. CSV stands for Comma Separated Values, each row resperents a datapoint, data sample, or entry. Each column represents an attribute or feature.

The first row has the name of the columns (optional, but is almost always the case). Pandas can the first row as labels for column.

Lets walkthrough some basic data wrangling tasks.

### Handling Missing Data

There are three main strategies for dealing with missing data:

#### Identifying Missing Data:

In [None]:
import pandas
import pandas as pd

data = pd.read_csv('messy_employee_data.csv')
#DATA HAS 12 COLUMNS 10 ROWS
#can also use data.info()

print(data.shape)
print("_________________")
print(data)
print("_________________")
print(data.isnull().sum())  # Shows the count of missing values per column



(10, 12)
_________________
   Employee_ID First_Name Last_Name  Gender   Age Department  \
0          101       John       Doe    Male  35.0         HR   
1          102       Jane     SMITH  female  28.0         IT   
2          103        Sam     Brown    Male  45.0    Finance   
3          104      Emily     Davis  Female  30.0         IT   
4          105    Michael   Johnson    Male   NaN         HR   
5          106      Laura    Wilson  Female  40.0  Marketing   
6          107     Daniel     Moore    male  33.0  Marketing   
7          108        NaN     Black    male  27.0         HR   
8          109   Patricia     Green  Female  55.0         HR   
9          101       John       Doe    Male  35.0         HR   

            Job_Title   Salary Date_Joined    Bonus Phone_Number  \
0          HR Manager  65000.0  2018-03-01   5000.0     555-1234   
1   Software Engineer  85000.0  2020-07-15   7000.0     555-5678   
2          Accountant  90000.0  2015-06-23   5500.0     555-2345

#### Removing Missing Data:
Multiple ways to handle this. Very rarely used.

In [None]:
#Remove rows with missing data:
data_cleaned1 = data.dropna()
#Drop columns with missing data:
data_cleaned2 = data.dropna(axis=1) #axis 0 is rows (default), axis 1 is columns
print("data_cleaned1")
print(data_cleaned1)
print("____________")
print("data_cleaned2")
print(data_cleaned2)

# The dropna method in pandas DataFrames includes a thresh parameter that specifies the minimum number of non-missing values a row or column must have to be kept. Rows or columns with fewer non-missing values than the threshold are removed.

print("____________")
df_thresh_10_row = data.dropna(thresh=10)
print("\nDataFrame after dropping rows with less than 10 non-NaN values:")
print("data_thresh_10")
print(df_thresh_10_row)

#If a row has at least 10 filled (non-NaN) values, it will be kept.
#If a row has less than 10 filled values, it will be dropped.


print("____________")
df_thresh_10_col = data.dropna(thresh=10, axis=1)
print("data_thresh_10 axis 1")
print("\nDataFrame after dropping columns with less than 10 non-NaN values:")
print(df_thresh_10_col)




data_cleaned1
   Employee_ID First_Name Last_Name  Gender   Age Department  \
0          101       John       Doe    Male  35.0         HR   
1          102       Jane     SMITH  female  28.0         IT   
2          103        Sam     Brown    Male  45.0    Finance   
3          104      Emily     Davis  Female  30.0         IT   
6          107     Daniel     Moore    male  33.0  Marketing   
8          109   Patricia     Green  Female  55.0         HR   
9          101       John       Doe    Male  35.0         HR   

            Job_Title   Salary Date_Joined    Bonus Phone_Number  \
0          HR Manager  65000.0  2018-03-01   5000.0     555-1234   
1   Software Engineer  85000.0  2020-07-15   7000.0     555-5678   
2          Accountant  90000.0  2015-06-23   5500.0     555-2345   
3      Data Scientist  95000.0  2019-11-12   8000.0     555-6789   
6  Content Strategist  70000.0  2021-01-22   4000.0     555-7890   
8          HR Manager  80000.0  2016-08-19  10000.0     555-1234 

#### Imputing Missing Data:

Multiple ways to handle this. Mostly used / goto approach


In [None]:

#NOTE if we set "inplace=TRUE" for any of the following methods, it will update the original dataframe.
#Fill missing values with a default value or the mean/median:
#data_filled_def=data[['Salary', 'Bonus', 'Age']].fillna(0) #filling all numeric with specific
#data_filled_def = data.fillna(0) #filling default value
#print(data_filled_def)

#Generally applied first for string columns.
#print(data.dtypes) # prints the data types for each column. We will look into setting and changing data types of columns later.
data_filled_def = data.fillna({'First_Name': 'Unknown','Date_Joined': 'Unknown','Phone_Number': 'Unknown'}) # filling specific columns with default values
#print(data_filled_def)

#one trick to get all columnnames that are numeric. (non object).
# numeric_cols = data_filled_def.select_dtypes(exclude='object').columns
# #if it is a completly numeric dataframe, we can just use data.mean().
# #filling with mean values, only works with numerical columns.
# data_filled_mean = data_filled_def.fillna(data_filled_def[numeric_cols].mean())
# #print("_______________")
# print(data_filled_mean)

#Forward or Backward Filling:
#data_ffill = data.ffill() #fill forward (takes the PRECEDING value) (goes from top bottom)
#print("_______________")
#print(data_ffill)
#
data_bfill = data.bfill() #fill backward (takes the SUCCEEDING value ) (from bottom top)
print("_______________")
print(data_bfill)


_______________
   Employee_ID First_Name Last_Name  Gender   Age Department  \
0          101       John       Doe    Male  35.0         HR   
1          102       Jane     SMITH  female  28.0         IT   
2          103        Sam     Brown    Male  45.0    Finance   
3          104      Emily     Davis  Female  30.0         IT   
4          105    Michael   Johnson    Male  40.0         HR   
5          106      Laura    Wilson  Female  40.0  Marketing   
6          107     Daniel     Moore    male  33.0  Marketing   
7          108   Patricia     Black    male  27.0         HR   
8          109   Patricia     Green  Female  55.0         HR   
9          101       John       Doe    Male  35.0         HR   

            Job_Title   Salary Date_Joined    Bonus Phone_Number  \
0          HR Manager  65000.0  2018-03-01   5000.0     555-1234   
1   Software Engineer  85000.0  2020-07-15   7000.0     555-5678   
2          Accountant  90000.0  2015-06-23   5500.0     555-2345   
3      

### Removing Duplicates

Duplicate data often distorts the results, especially in large datasets.

#### Identifying Duplicates:



In [None]:
print(data.duplicated())  # Returns True for duplicate rows (in this case the 9th row was a duplicate of the first )

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9     True
dtype: bool


#### Removing Duplicates:

In [None]:
print(data)
#Remove exact duplicate rows:
# data_cleaned1 = data.drop_duplicates()
# print(data_cleaned1)
#Remove duplicates based on specific columns:
# data_cleaned2 = data.drop_duplicates(subset=['Salary']) #even if other columns are diff is salary matches it will drop the second one and onwards
# print(data_cleaned2)


Before we move on...
* read_csv can also be used to read other types of files, you can specify your own separator. basic syntax:  pd.read_csv("\<finename.extension\>", sep="\<separator\>")
* You can also specify data types (dtype) of the column, and specify NaN values (na_values)  during the read call.
* Refer Biological data exploration with Python, pandas and seaborn - pages 14 onwards.


* Test these out:
dataframe.info() can give you:
    - what class the object is (a DataFrame )
    - what the index looks like (a range )
    - how many data columns we have
    - for each column, how many values and their dtype
    - a summary of how many columns have each dtype

and dataframe.describe() can give you some basic statistics of each column.


* You can also specify your own column names and add it either while reading file or after that.

In [None]:
# example for using own column names

my_columns=["col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8", "col9", "col10","col11","col12"]
#note it considers first line as data in this method.
example1=pd.read_csv('messy_employee_data.csv',names=my_columns, na_values='?')#does not fully replace names
print(example1.head())

#note this replaces the column names , this method.
example2=pd.read_csv('messy_employee_data.csv')
example2.columns = my_columns
print(example2.head())

          col1        col2       col3    col4  col5        col6  \
0  Employee_ID  First_Name  Last_Name  Gender   Age  Department   
1          101        John        Doe    Male  35.0          HR   
2          102        Jane      SMITH  female  28.0          IT   
3          103         Sam      Brown    Male  45.0     Finance   
4          104       Emily      Davis  Female  30.0          IT   

                col7     col8         col9   col10         col11  \
0          Job_Title   Salary  Date_Joined   Bonus  Phone_Number   
1         HR Manager  65000.0   2018-03-01  5000.0      555-1234   
2  Software Engineer  85000.0   2020-07-15  7000.0      555-5678   
3         Accountant  90000.0   2015-06-23  5500.0      555-2345   
4     Data Scientist  95000.0   2019-11-12  8000.0      555-6789   

                    col12  
0                   Email  
1      john.doe@gmail.com  
2  jane.smith@example.com  
3   sam.brown@company.com  
4     emily.davis@company  
   col1     col2    


# Series

* You can build a dataframe with Series objects
* Pandas provides a rich set of functionalities with Series objects are one-dimensional arrays.
* Boardcasting operations is a very useful tool.
* Each column in a dataframe is considered a series.


In [None]:
import pandas as pd
#these are all individual Series you can call them alone (the columns) but u can also make two columns or more a DF

# building dataframe with Series objects
print("_______________")
a = pd.Series([1,2,3,4,5,6,7,8,9])
b=pd.Series(["One","Two","Three","Four","Five","Six","Seven"])
c=pd.Series([0,1,0,1,0,1,0,1,0,1])
newdata=pd.DataFrame({"column1":a,"column2":b})

print(newdata)
print("_______________")

#you can additionaly specify datatype with dtype attribute for Series

#broadcasting
print(a * 2)
print("_______________")
newdata['c']=newdata['column1'] * c
print(newdata)
print(type(newdata['column1']))

print("_______________")
print(a**2)

_______________
   column1 column2
0        1     One
1        2     Two
2        3   Three
3        4    Four
4        5    Five
5        6     Six
6        7   Seven
7        8     NaN
8        9     NaN
_______________
0     2
1     4
2     6
3     8
4    10
5    12
6    14
7    16
8    18
dtype: int64
_______________
   column1 column2    c
0        1     One  0.0
1        2     Two  2.0
2        3   Three  0.0
3        4    Four  4.0
4        5    Five  0.0
5        6     Six  6.0
6        7   Seven  0.0
7        8     NaN  8.0
8        9     NaN  0.0
<class 'pandas.core.series.Series'>
_______________
0     1
1     4
2     9
3    16
4    25
5    36
6    49
7    64
8    81
dtype: int64


### Changing Data Types
Changing data types is crucial for analysis. For example, dates are better represented as datetime objects than as strings.


#### Convert Columns to Correct Data Types:
An example with date type


In [None]:
# Convert a column to datetime
data_1 = data.drop_duplicates() #dropping duplicates
data_1['Date_Joined'] = pd.to_datetime(data_1['Date_Joined'])
# date time objects allow you to call specific functions on them. Like getting the delta and so on.
# Convert numerical column to integer
data_1 = data_1.fillna({'Salary': 0})
data_1['Salary'] = data_1['Salary'].astype('int')

#Check Data Types:
print(data_1.dtypes)


Employee_ID              int64
First_Name              object
Last_Name               object
Gender                  object
Age                    float64
Department              object
Job_Title               object
Salary                   int64
Date_Joined     datetime64[ns]
Bonus                  float64
Phone_Number            object
Email                   object
dtype: object


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_1['Date_Joined'] = pd.to_datetime(data_1['Date_Joined'])


### String Manipulation
Strings need to be cleaned or transformed for consistency.


In [None]:
#Example 1: Stripping Whitespace
data_1['First_Name'] = data_1['First_Name'].str.strip()  # Remove leading/trailing whitespaces(does on entire column)
print(data_1.dtypes)
#Example 2: Extracting Patterns with Regex
#creating a new column
import re
data_1['year'] = data_1['Date_Joined'].astype(str).str.extract(r'(\d{4})')  # Extract year from a date string, there are other ways to do this as it was already a datatime object
#Example 3: Replacing Values
#data_1['Gender'] = data_1['Gender'].replace({'male': 'M','Male': 'M', 'female': 'F','Female': 'F'})
data_1['Gender'] = data_1['Gender'].replace({re.compile('[mM]ale'): 'M',re.compile('[fF]emale'): 'F'})
print(data_1)




Employee_ID              int64
First_Name              object
Last_Name               object
Gender                  object
Age                    float64
Department              object
Job_Title               object
Salary                   int64
Date_Joined     datetime64[ns]
Bonus                  float64
Phone_Number            object
Email                   object
dtype: object
   Employee_ID First_Name Last_Name  Gender   Age Department  \
0          101       John       Doe    Male  35.0         HR   
1          102       Jane     SMITH  female  28.0         IT   
2          103        Sam     Brown    Male  45.0    Finance   
3          104      Emily     Davis  Female  30.0         IT   
4          105    Michael   Johnson    Male   NaN         HR   
5          106      Laura    Wilson  Female  40.0  Marketing   
6          107     Daniel     Moore    male  33.0  Marketing   
7          108        NaN     Black    male  27.0         HR   
8          109   Patricia     Green  F

In [None]:
#can tackle part a and b in hw


## Merging DAtasets:
Merging datasets is useful when combining data from different sources or tables.

In [None]:
print(data)

newdata=pd.read_csv("messy_employee_data2.csv")
print('-----------------------------------------------------------------------------')

print(newdata)

#merged_data = pd.merge(data, newdata, how='outer')

#you can use "on" attribute if useing only few common columns
merged_data = pd.merge(data, newdata,on="Employee_ID", how='outer') #'outer' acts as an 'OR' so it includes all rows. 'inner' acts as the AND (only matches from the 'on')
print(merged_data)
#merged_data=merged_data.drop_duplicates()
#merged_data['Gender'] = merged_data['Gender'].replace({'male': 'M','Male': 'M', 'female': 'F','Female': 'F'})

   Employee_ID First_Name Last_Name  Gender   Age Department  \
0          101       John       Doe    Male  35.0         HR   
1          102       Jane     SMITH  female  28.0         IT   
2          103        Sam     Brown    Male  45.0    Finance   
3          104      Emily     Davis  Female  30.0         IT   
4          105    Michael   Johnson    Male   NaN         HR   
5          106      Laura    Wilson  Female  40.0  Marketing   
6          107     Daniel     Moore    male  33.0  Marketing   
7          108        NaN     Black    male  27.0         HR   
8          109   Patricia     Green  Female  55.0         HR   
9          101       John       Doe    Male  35.0         HR   

            Job_Title   Salary Date_Joined    Bonus Phone_Number  \
0          HR Manager  65000.0  2018-03-01   5000.0     555-1234   
1   Software Engineer  85000.0  2020-07-15   7000.0     555-5678   
2          Accountant  90000.0  2015-06-23   5500.0     555-2345   
3      Data Scientist  

* Inner Merge:
Returns only rows where the merge key (column or index) exists in both dataframes. This is the default merge type.
* Outer Merge:
Returns all rows from both dataframes. If a row exists in one dataframe but not the other, missing values (NaN) are added.
* Left Merge:
Returns all rows from the "left" dataframe and matching rows from the "right" dataframe. Non-matching rows from the right dataframe result in NaN values.
* Right Merge:
Returns all rows from the "right" dataframe and matching rows from the "left" dataframe. Non-matching rows from the left dataframe result in NaN values.
_______________

* The on parameter specifies the column(s) to merge on. If column names differ, left_on and right_on can be used. Merging on index is possible using left_index=True or right_index=True.
* Without specifying the on argument, the merge() function automatically looks for columns with the same names in both DataFrames. If it finds one or more columns with matching names, it uses those columns as the join keys. This behavior is equivalent to explicitly specifying the on argument with the common column names. If no common columns are found, and neither left_index nor right_index is set to True, the function will raise a MergeError.

In [None]:
import pandas as pd

# Sample DataFrames
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Amith', 'Kamath', 'Belman']})
df2 = pd.DataFrame({'ID': [2, 3, 4], 'Type': ["First", "Last1", "Last2"]})

print(df1)
print(df2)
print('-----------------------------------------------------------------------------------')
# Inner Merge (default)
inner_merged = pd.merge(df1, df2, on='ID')
print("inner_merged")
print(inner_merged)
print('-----------------------------------------------------------------------------------')
# Outer Merge
outer_merged = pd.merge(df1, df2, on='ID', how='outer')
print("outer_merged")
print(outer_merged)
print('-----------------------------------------------------------------------------------')
# Left Merge
left_merged = pd.merge(df1, df2, on='ID', how='left')
print("left_merged")
print(left_merged)
print('-----------------------------------------------------------------------------------')
# Right Merge
right_merged = pd.merge(df1, df2, on='ID', how='right')
print("right_merged")
print(right_merged)

   ID    Name
0   1   Amith
1   2  Kamath
2   3  Belman
   ID   Type
0   2  First
1   3  Last1
2   4  Last2
-----------------------------------------------------------------------------------
inner_merged
   ID    Name   Type
0   2  Kamath  First
1   3  Belman  Last1
-----------------------------------------------------------------------------------
outer_merged
   ID    Name   Type
0   1   Amith    NaN
1   2  Kamath  First
2   3  Belman  Last1
3   4     NaN  Last2
-----------------------------------------------------------------------------------
left_merged
   ID    Name   Type
0   1   Amith    NaN
1   2  Kamath  First
2   3  Belman  Last1
-----------------------------------------------------------------------------------
right_merged
   ID    Name   Type
0   2  Kamath  First
1   3  Belman  Last1
2   4     NaN  Last2



### Dataframe Transformations

#### pivot and pivot_table

* The pivot() function in the Pandas library is used to reshape a DataFrame by converting unique values from one column into new columns. This is useful for transforming data from a "long" format to a "wide" format.
* basic syntax: df.pivot(index, columns, values)

* If index contains duplicates, you can use pivot_table
* basic syntax: df.pivot_table(index, columns, values, aggfunc optional, fill_value optional)

values: The column to aggregate.
index: The column to group by on the pivot table index (rows).
columns: The column to group by on the pivot table columns.
aggfunc: The aggregation function to use (e.g., sum, mean, count). Defaults to mean.
fill_value: Value to replace missing values with. default NaN.

#### groupby
* The groupby() method in pandas is used to group rows in a DataFrame based on the values in one or more columns. This allows you to perform aggregate functions on each group, such as calculating the sum, mean, size, or count and so on.
* basic syntax: df.groupby(\[columns\])\[selection column\].agg(\[aggregation function\])


In [2]:
import pandas as pd

df = pd.DataFrame({'A': ['foo', 'foo', 'bar', 'bar',
                         'foo', 'foo', 'bar', 'bar'],
                    'B': ['one', 'one', 'one', 'two',
                         'two', 'two', 'one', 'two'],
                    'C': [1, 2, 3, 4, 5, 6, 7, 8],
                    'D': [9, 10, 11, 12, 13, 14, 15, 16]})
print("Original data")
print(df)
# assume I want to compute avereage C value  per value of 'A'
# we can use groupby
print('Groupby example')
print(df.groupby(['A'])['C'].agg(['mean']))
print("__________________")



Original data
     A    B  C   D
0  foo  one  1   9
1  foo  one  2  10
2  bar  one  3  11
3  bar  two  4  12
4  foo  two  5  13
5  foo  two  6  14
6  bar  one  7  15
7  bar  two  8  16
Groupby example
     mean
A        
bar   5.5
foo   3.5
__________________


In [3]:
#what if I want to compute average of C value per value of A per value of B
print("__________________")
print("with groupby")
print(df.groupby(['A','B'])['C'].agg(['mean'])),((['count']))


#pivot table is useful when we are using two or more categorical values like above.

#what if I want to compute average of C value per value of A per value of B
print("__________________")
print("with pivot_table")
pivoted_df = df.pivot_table(index='A', columns='B', values='C',aggfunc='mean')
print(pivoted_df)
#you can use aggfunc parameter to set aggregation function to other things, like min, max and so on.


__________________
with groupby
         mean
A   B        
bar one   5.0
    two   6.0
foo one   1.5
    two   5.5
__________________
with pivot_table
B    one  two
A            
bar  5.0  6.0
foo  1.5  5.5


In [None]:

#hands-on activity

import seaborn as sns
tipsdata=sns.load_dataset('tips')

print(tipsdata.shape)
print(tipsdata.columns)
print(tipsdata.head())

print("______________")
# how much did people tip on average per day of week per time  of day?
mean_day=print(tipsdata.groupby(['day', 'time'])['tip'].agg(['mean']))


print("______________")

#what is the average party size at time of day (lunch at dinner), what is the max and min party size.
avg_size=print(tipsdata.groupby(['time'])['size'].agg(['mean']))
print("______________")
avg_size=print(tipsdata.groupby(['time'])['size'].agg(['min']))
print("______________")
avg_size=print(tipsdata.groupby(['time'])['size'].agg(['max']))
print("______________")

#how much do people tip (average, max, min) per gender per smoker type per size do this with both groupby and pivot table.

print(tipsdata.groupby(['sex','smoker','size'])['tip'].agg(['mean'])),((['max'])),((['min']))
print("______________")
pivoted_tips=tipsdata.pivot_table(index='smoker',columns=('sex', 'size'), values='tip',aggfunc=('mean', 'max', 'min'))
print(pivoted_tips)

"""
#pivot table is useful when we are using two or more categorical values like above.

#what if I want to compute average of C value per value of A per value of B
print("__________________")
print("with pivot_table")
pivoted_df = df.pivot_table(index='A', columns='B', values='C',aggfunc='mean')
print(pivoted_df)
#you can use aggfunc parameter to set aggregation function to other things, like min, max and so on.
"""




(244, 7)
Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size'], dtype='object')
   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4
______________
                 mean
day  time            
Thur Lunch   2.767705
     Dinner  3.000000
Fri  Lunch   2.382857
     Dinner  2.940000
Sat  Lunch        NaN
     Dinner  2.993103
Sun  Lunch        NaN
     Dinner  3.255132
______________
            mean
time            
Lunch   2.411765
Dinner  2.630682
______________
        min
time       
Lunch     1
Dinner    1
______________
        max
time       
Lunch     6
Dinner    6
______________
                        mean
sex    smoker size          
Male   Yes    1     1.920000
              2     2.692927
       

  mean_day=print(tipsdata.groupby(['day', 'time'])['tip'].agg(['mean']))
  avg_size=print(tipsdata.groupby(['time'])['size'].agg(['mean']))
  avg_size=print(tipsdata.groupby(['time'])['size'].agg(['min']))
  avg_size=print(tipsdata.groupby(['time'])['size'].agg(['max']))
  print(tipsdata.groupby(['sex','smoker','size'])['tip'].agg(['mean'])),((['max'])),((['min']))
  pivoted_tips=tipsdata.pivot_table(index='smoker',columns=('sex', 'size'), values='tip',aggfunc=('mean', 'max', 'min'))


'\n#pivot table is useful when we are using two or more categorical values like above.\n\n#what if I want to compute average of C value per value of A per value of B\nprint("__________________")\nprint("with pivot_table")\npivoted_df = df.pivot_table(index=\'A\', columns=\'B\', values=\'C\',aggfunc=\'mean\')\nprint(pivoted_df)\n#you can use aggfunc parameter to set aggregation function to other things, like min, max and so on.\n'