# Objective :
To create a function called **Graph** which will help us perform an initial exploratory study of univariate analysis of any dataset or file.

We will be plotting histograms, barplots and boxplots to understand our dataset !

#### Below are 3 ideas with output ( includes 2 improvisations )

#### Idea #1 :
Creating function that will categorize numerical and categorical variable from the input dataset and plot graph for all variables present in that dataset

Input from user : path i.e directory, File name

In [3]:
def graph():
    
    import os
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    sns.set()
    sns.set_color_codes()
    
    # Allow user to enter directory to fetch the file
    dir = input("Enter the path : ")     # e.g : C:\\Users\\lenovo\\Desktop\\Praxis\\Practice_doc (without " ")
    os.chdir(dir)
    
    # Allow user to enter the name of the file
    # user can call the function multiple times for different data
    file = input("Enter the file name : ")   # e.g cars.csv (without " ")
    df = pd.read_csv(file) 
        
    for i in df:
        # Check whether the variable is categorical or numerical 
        if (np.dtype(df[i]) != 'object' and len(np.unique(df[i]>10))):  
            # If the datatype is not object then consider it as a numerical variable (i.e., float, int)
            # For plotting a histogram
            plt.figure(figsize=(10,8))             # setting a size for the plot
            sns.distplot(df[i], hist=True, kde=False, color = "b")
            plt.ylabel("Number of Cars", fontsize=11)
            plt.savefig(str(i)+"_histogram.png")   # to save the plot as an image  
            plt.close()

            # For plotting boxplot
            df.boxplot(column=i, grid=False, notch=False, vert=False, figsize=(10,8))
            plt.savefig(str(i)+"_boxplot.png")
            plt.close()

        else:
            # For plotting a bar graph
            sns.countplot(df[i], palette="Blues_d")
            #sns.barplot(data=df, orient="v", palette="Blues_d")
            plt.savefig(str(i)+"_bar_graph.png")
            plt.close()
    
    # Print the statistics i.e means,min,max of the data
    Describe = df.describe()   
    Describe.to_csv("Summary Statistics.csv")
    print("Please check the folder now !")

In [4]:
graph()

Enter the path : /users/rachita/Desktop/Python/Graph Function/
Enter the file name : cars.csv
Please check the folder now !


#### Idea 2 : Improvisation #1
If a user does not wish to plot all the graphs, we will provide user with such an option !

To do so, we shall ask user to enter specific column name they are interested in. The function will then plot a graph  for only the interested variables. In case the user wishes to see the graph for all variables, they can enter "All" the input window.

Input from user : directory, file name, required column names

In [5]:
def graph():
    
    import os
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    sns.set()
    sns.set_color_codes()
    
    dir = input("Enter the path : ")  
    os.chdir(dir)
    
    file = input("Enter the file name : ")   
    df = pd.read_csv(file) 
    
    # Allow user to enter specific column name 
    print("Print all columns if column name not specified.")
    col = input("Enter column names separated by comma : ")
    col1 = list(col.split(","))
    if col1 == ["All"]:
        col = []
        for i in df:
            col.append(i)
    else:
        col = col1
        
    for i in col:
        if (np.dtype(df[i]) != 'object' and len(np.unique(df[i]>10))):
            # For plotting a histogram
            plt.figure(figsize=(10,8))             
            sns.distplot(df[i], hist=True, kde=False, color = "b")
            plt.ylabel("Number of Cars", fontsize=11)
            plt.savefig(str(i)+"_histogram.png")
            plt.close()

            # For plotting boxplot
            df.boxplot(column=i, grid=False, notch=False, vert=False, figsize=(10,8))
            plt.savefig(str(i)+"_boxplot.png")
            plt.close()

        else:
            # For plotting a bar graph
            sns.countplot(df[i], palette="Blues_d")
            plt.savefig(str(i)+"_bar_graph.png")
            plt.close()
            
    Describe = df.describe()   
    Describe.to_csv("Summary Statistics.csv")
    
    print("Please check the folder now !")
    

In [6]:
graph()

Enter the path : /users/rachita/Desktop/Python/Graph Function/
Enter the file name : cars.csv
Print all columns if column name not specified.
Enter column names separated by comma : All
Please check the folder now !


#### Idea 3 : Improvisation #2 

Problem : User does not want graphs of all variable in main folder
            - if there are many variable, all the graphs will be jumbled and basically be a big mess !
            - if the user uses the Graph function for muliple datasets, again multiple graphs for multiple 
                variables, a bigger mess !

To resolve this, we will ask user to enter folder name with every dataset/csv file which will allow user to create different folders for different files ! 

In [7]:
def graph():
    
    import os
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    sns.set()
    sns.set_color_codes()
    
    dir = input("Enter the path : ")
    os.chdir(dir)
    
    file = input("Enter the file name : ")
    df = pd.read_csv(file) 
    
    # Allow user to enter folder name 
    folder1 = input("Enter folder name : ")
    os.mkdir(folder1)
    folder =  dir + '/'+ '/' + folder1
    os.chdir(folder)
    
    print("Print all columns if column name not specified. ")
    col = input("Enter column names separated by comma : ")
    col1 = list(col.split(","))
    if col1 == ["All"]:
        col = []
        for i in df:
            col.append(i)
    else:
        col = col1
        
    for i in col:
        if (np.dtype(df[i]) != 'object' and len(np.unique(df[i]>10))):
            # For plotting a histogram
            plt.figure(figsize=(10,8))             
            sns.distplot(df[i], hist=True, kde=False, color = "b")
            plt.ylabel("Number of Cars", fontsize=11)
            plt.savefig(str(i)+"_histogram.png")
            plt.close()

            # For plotting boxplot
            df.boxplot(column=i, grid=False, notch=False, vert=False, figsize=(10,8))
            plt.savefig(str(i)+"_boxplot.png")
            plt.close()

        else:
            # For plotting a bar graph
            sns.countplot(df[i], palette="Blues_d")
            plt.savefig(str(i)+"_bar_graph.png")
            plt.close()
            
    Describe = df.describe()   
    Describe.to_csv("Summary Statistics.csv")
    
    print("Please check the folder now !")

In [8]:
graph()

Enter the path : /users/rachita/Desktop/Python/Graph Function/
Enter the file name : cars.csv
Enter folder name : Cars graph function
Print all columns if column name not specified. 
Enter column names separated by comma : MPG,Acceleration,Origin
Please check the folder now !
