## Introduction
![](https://upload.wikimedia.org/wikipedia/commons/thumb/6/66/Chamber_of_Deputies_of_Brazil_2.jpg/272px-Chamber_of_Deputies_of_Brazil_2.jpg)


## Exploratory Analysis

### Imports

<p><font size="3" color="Blue">    

> We are using a typical data science stack: `numpy`, `pandas`, `sklearn`, `matplotlib`. 
    
</font></p>

In [None]:
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt # plotting
import numpy as np # linear algebra
import os # accessing directory structure
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


In [None]:
# Suppress warnings 
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

from IPython.display import HTML


<html>
<body>

<p><font size="5" color="Purple">If you find this kernel useful or interesting, please don't forget to upvote the kernel =)

</body>
</html>



# Thinking Brazil - Public Spending

In [None]:
HTML('<iframe width="880" height="520" src="https://www.youtube.com/embed/as5_mTfDEw8" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

<p><font size="3" color="Blue">    
There is 1 csv file in the current version of the dataset:
    
</font></p>

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))



<p><font size="3" color="Blue">    
The next hidden code cells define functions for plotting data. Click on the "Code" button in the published kernel to reveal the hidden code.    
</font></p>

In [None]:
# Distribution graphs (histogram/bar graph) of column data
def plotPerColumnDistribution(df, nGraphShown, nGraphPerRow):
    nunique = df.nunique()
    df = df[[col for col in df if nunique[col] > 1 and nunique[col] < 50]] # For displaying purposes, pick columns that have between 1 and 50 unique values
    nRow, nCol = df.shape
    columnNames = list(df)
    nGraphRow = (nCol + nGraphPerRow - 1) / nGraphPerRow
    plt.figure(num = None, figsize = (6 * nGraphPerRow, 8 * nGraphRow), dpi = 80, facecolor = 'w', edgecolor = 'k')
    for i in range(min(nCol, nGraphShown)):
        plt.subplot(nGraphRow, nGraphPerRow, i + 1)
        columnDf = df.iloc[:, i]
        if (not np.issubdtype(type(columnDf.iloc[0]), np.number)):
            valueCounts = columnDf.value_counts()
            valueCounts.plot.bar()
        else:
            columnDf.hist()
        plt.ylabel('counts')
        plt.xticks(rotation = 90)
        plt.title(f'{columnNames[i]} (column {i})')
    plt.tight_layout(pad = 1.0, w_pad = 1.0, h_pad = 1.0)
    plt.show()


In [None]:
# Correlation matrix
def plotCorrelationMatrix(df, graphWidth):
    filename = df.dataframeName
    df = df.dropna('columns') # drop columns with NaN
    df = df[[col for col in df if df[col].nunique() > 1]] # keep columns where there are more than 1 unique values
    if df.shape[1] < 2:
        print(f'No correlation plots shown: The number of non-NaN or constant columns ({df.shape[1]}) is less than 2')
        return
    corr = df.corr()
    plt.figure(num=None, figsize=(graphWidth, graphWidth), dpi=80, facecolor='w', edgecolor='k')
    corrMat = plt.matshow(corr, fignum = 1)
    plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
    plt.yticks(range(len(corr.columns)), corr.columns)
    plt.gca().xaxis.tick_bottom()
    plt.colorbar(corrMat)
    plt.title(f'Correlation Matrix for {filename}', fontsize=15)
    plt.show()


In [None]:
# Scatter and density plots
def plotScatterMatrix(df, plotSize, textSize):
    df = df.select_dtypes(include =[np.number]) # keep only numerical columns
    # Remove rows and columns that would lead to df being singular
    df = df.dropna('columns')
    df = df[[col for col in df if df[col].nunique() > 1]] # keep columns where there are more than 1 unique values
    columnNames = list(df)
    if len(columnNames) > 10: # reduce the number of columns for matrix inversion of kernel density plots
        columnNames = columnNames[:10]
    df = df[columnNames]
    ax = pd.plotting.scatter_matrix(df, alpha=0.75, figsize=[plotSize, plotSize], diagonal='kde')
    corrs = df.corr().values
    for i, j in zip(*plt.np.triu_indices_from(ax, k = 1)):
        ax[i, j].annotate('Corr. coef = %.3f' % corrs[i, j], (0.8, 0.2), xycoords='axes fraction', ha='center', va='center', size=textSize)
    plt.suptitle('Scatter and Density Plot')
    plt.show()



<p><font size="3" color="Blue">    
Now you're ready to read in the data and use the plotting functions to visualize the data.

</font></p>

### Let's check 1st file: /kaggle/input/2019_OrcamentoDespesa.csv

In [None]:
nRowsRead = 1000 # specify 'None' if want to read whole file
mColsRead = 100 # specify 'None' if want to read whole file

# 2019_OrcamentoDespesa.csv may have more rows in reality, but we are only loading/previewing the first 1000 rows
df1 = pd.read_csv('/kaggle/input/2019_OrcamentoDespesa.csv', delimiter=";", encoding="ISO-8859-9")
df1.dataframeName = '2019_OrcamentoDespesa.csv'
nRow, nCol = df1.shape
print(f'There are {nRow} rows and {nCol} columns')


<p><font size="3" color="Blue">    
Let's take a quick look at what the data looks like:


</font></p>

In [None]:
df1.head(5)

# Column Types

In [None]:
df1.dtypes

In [None]:
df1.dtypes.value_counts()

# Examine Missing Values
> Next we can look at the number and percentage of missing values in each column.


In [None]:
total = df1.isnull().sum().sort_values(ascending = False)
percent = (df1.isnull().sum()/df1.isnull().count()*100).sort_values(ascending = False)
missing__train_data  = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing__train_data.head(10)


<p><font size="3" color="Blue">    
Distribution graphs (histogram/bar graph) of sampled columns:

</font></p>

In [None]:
plotPerColumnDistribution(df1, 10, 5)


<p><font size="3" color="Blue">    
 Correlation matrix:

</font></p>

Now that we have dealt with the categorical variables and the outliers, let's continue with the EDA. One way to try and understand the data is by looking for correlations between the features and the target. We can calculate the Pearson correlation coefficient between every variable and the target using the .corr dataframe method.

The correlation coefficient is not the greatest method to represent "relevance" of a feature, but it does give us an idea of possible relationships within the data. Some general interpretations of the absolute value of the correlation coefficent are:

> .00-.19 “very weak”

> .20-.39 “weak”

> .40-.59 “moderate”

> .60-.79 “strong”

> .80-1.0 “very strong

In [None]:
plotCorrelationMatrix(df1, 8)


<p><font size="3" color="Blue">    
Scatter and density plots:

</font></p>

In [None]:

plotScatterMatrix(df1, 20, 10)

# Final