<h1>1. Introduction</h1>
<p>
Hi, I'm a novice at Kaggle. This is my first Kaggle exercise. Please give me any general comments or advice below. It will help me a lot. Below at my code, I made some comments/questions that I was unable to figure out. If you have any brilliant (or obvious) answers, please share with everyone. I am interested in <b>environmental</b> topics such as <i>climate change, water management, air pollution, environmental crime</i>, etc. It would therefore great if you could share other interesting Kaggle items.    
</p>
<p>
Okay, it is now time. I'll start by importing packages
</p>

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from IPython.display import display ## for multiple display
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
#from matplotlib import style 
#plt.style.use("fivethirtyeight")
#print(plt.style.available)

<h1>2. Skim through data</h1>
<p>
Take a quick look at the data before we begin.
</p>

In [None]:
## import the file and display the head
wetlands = pd.read_csv("../input/wetlands.csv")
wetlands.head()

<p>We have in total 7 columns (attributes) :</p>

<blockquote>
OBJECTID, ATTRIBUTE, WETLAND_TYPE, ACRES, GLOBALID, ShapeSTArea, ShapeSTLength
</blockquote>
<p>At first glance, two columns <b>OBJECTID</b> and <b>GLOBALID</b> do not look useful (You may check more rows than just what apprears by <code>head()</code>. We will drop these columns at the next cell. Both <b>ATTRIBUTE</b> and <b>WETLAND_TYPE</b> seem to have information on types of wetlands, I guess. I don't know however the difference between the two. Let's look at it later below. <b>ACRES </b>obviously indicates the size (my hands already google to convert it to squared meters...). <b>ShapeSTArea</b> and <b>ShapeSTLength</b> seem to be terms used by GIS people. I guess they are numbers measured from satellite images. </p>

<p> One immediate question is what is the relationship between <b>ACRES</b> and <b>ShapeSTArea</b>. I can't find proper information about how they are measured. Is it that <b>ACRES</b> might be the real size and <b>ShapeSTArea</b> is measured from image? It is just my guess... Anyone knows? Let's see the relation later below

<p><code>info()</code> indicates that the dataframe has 670,117 entries and only ACRES column has 3 missing values (null). I display them with <code>isnull()</code></p>

In [None]:
## Display the info about the table
display(wetlands.info())

## We won't need these columns: OBJECTID, GLOBALID for our exercise so drop them.
## OBJECTID is saved weirdly. Anyone knows if there's specific reason? 
## I found the right name with wetlands.columns[0]
wetlands.drop(['\ufeffOBJECTID', 'GLOBALID'], axis=1, inplace=True) 

## There are 3 entries in ACRES without value. Display the ones. We don't do any for now
display(wetlands[wetlands['ACRES'].isnull()])

<h1>3. Juggle with data to find relations</h1>
<p>From a look into the table, I realized that there are a number of rows whose first letters in <b>ATTRIBUTE</b> and <b>WETLAND_TYPE</b> match (for instance L2UBH in ATTRIBUTE and Lake in WETLAND_TYPE). Let's go deep into it to discover more.</p>
<p>I tried to categorize <b>ATTRIBUTE</b> and learned that there are 1,076 categories! How detail is it...? I did some research on internet about wetland categorization. I ended up bumping into National Wetland Inventory under US Government. Have a look at the document (link provided below). It explains all secret codes you see in <b>ATTRIBUTE</b> (it has indeed quite detailed categorization system). </p>
<p>
<a href="https://www.fws.gov/wetlands/documents/NWI_Wetlands_and_Deepwater_Map_Code_Diagram.pdf">  Wetland Map Code Diagram</a>
</p> 
<p>As per the code map, I categorized <b>ATTRIBUTE</b> values into 3 categories: <b>L</b>, <b>P</b>, and <b>R</b>. and enter these values into a newly created column named <b>CATEGORY</b>.</p>
<p>Before comparing <b>CATEGORY</b> with <b>WETLAND_TYPE</b>, I first tried to categorize the latter and got 6 categories. <i>Lake</i> of course matches with <i>Lacustrine</i> (L category above). <i>Riverine</i> is <i>Riverine</i> (R category above). All three categories starting with Freshwater should fall into the category of <i>Palustrine</i>. I've never heard of Palustrine nor Lacustrine (who has?) but pond-like wetlands must be Palustrine as far as I understood (saved it in a new column called <b>WETLAND_TYPE_SIMPLE</b>). We leave <i>Others</i> as it is for now. And let's see how many entries we have in each categories in both <b>CATEGORY</b> and <b>WETLAND_TYPE_SIMPLE</b></p>

In [None]:
## Group values of ATTRIBUTE into CATEGORY, and compare with WETLAND_TYPE
att = wetlands.ATTRIBUTE
att.astype('category')  ## ATTRIBUTE has 1076 categories. It is too much to categorize...

## Realize that these categories all start with first letters either L, P, or R
wetlands['CATEGORY'] = att.apply(lambda x: x[0])  ## Create a new column CATEGORY with the first letter
cat = wetlands.CATEGORY
cat.astype('category')
display(cat.groupby(cat).count())  ## It has now only 3 categories

## Let's compare with WETLAND_TYPE.
wtype = wetlands.WETLAND_TYPE
wtype.astype('category')  ## WETLAND_TYPE has 6 categories. 

## Let's group all Freshwater categories into Palustrine. Lake into Lacustrine, Riverine into Riverine
wtype_dic = dict({'Freshwater Emergent Wetland': 'Palustrine', 
                  'Freshwater Forested/Shrub Wetland': 'Palustrine', 
                  'Freshwater Pond': 'Palustrine', 
                  'Lake': 'Lacustrine', 
                  'Other': 'Other', 
                  'Riverine': 'Riverine'})
wtype = wtype.apply(lambda x: wtype_dic[x])
wetlands = wetlands.join(wtype, rsuffix='_SIMPLE') ## Create a new column named WETLAND_TYPE_SIMPLE
wtype = wetlands['WETLAND_TYPE_SIMPLE']
display(wtype.groupby(wtype).count())

<p>
It is interesting... (or just as expected?). Category L and R match in number with Lacustrine and Riverine under <b>WETLAND_TYPE_SIMPLE</b>. If you add up the numbers of Palustrine and Other (649,111 + 12,826), then it matches to 661,937. We can reasonably assume that entries with Other value fall into P category. And for simplicity, we can from now on use <b>CATEGORY (L, P, R) </b>for types of wetlands instead of <b>WETLAND_TYPE</b> or <b>WETLAND_TYPE_SIMPLE</b>.
</p>

<p>
Now, let's look at columns related to size of wetlands. At first glance, <b>ACRES</b> and <b>ShapeSTArea</b> seem somewhat related. If my assumption is correct, <b>ShapeSTArea</b> should be measured in squred meters and the ratio of <b>ShapeSTArea</b> over <b>ACRES</b> would be (almost) uniform. Keep in mind that </p>
<blockquote>
1 acre = 4046.86 m<sup>2</sup>
</blockquote>
<p>
First, let's see the mean and standard deviation of ratio of <b>ShapeSTArea</b> over <b>ACRES</b>. And then display some quantile points of its distribution so that we can approximately guess what the distribution looks like. For this, I used median and both end points, and also 90% and 95% of the data set boundary around the mean: <code>quantile([0, 0.025, 0.05, 0.5, 0.95, 0.975, 1])</code>. For this exercise I used 95% data around the mean. It means 2.5% data on both sides (toward minimum and maximum values) to be treated as excentric. As a matter of face, standard deviation is really BIG and you can see some top few values, when you sort out ratio values in descending order, is order of 10 to the 8th power!! 
</p>
<p>
To visualize how values are distributed, I devided the interval within the 95% data set by 0.5 (of ratio value) and bar plotted how many entries belong to each interval including "excentric" groups.
</p>

In [None]:
## At first glance, ACRES and ShapeSTArea seems quite related to each other.
## 1 acre = 4046.86 square meters
wetlands['AREA_RATIO'] = wetlands.ShapeSTArea / wetlands.ACRES
area_ratio = wetlands['AREA_RATIO'].dropna()

## Prepare a plot
fig, axis1 = plt.subplots(1,1,figsize=(5,5))

## Distribution of area ratio
print("mean: {}, std: {}".format(area_ratio.mean(), area_ratio.std()))
#area_ratio_l.sort_values(ascending=False)  ## We can see a few excentric values
display(area_ratio.quantile([0, 0.025, 0.05, 0.5, 0.95, 0.975, 1]))  ## Peek through distribution

## Take only 95% values near its mean (i.e. drop excentric values)
lbound = area_ratio.quantile(0.025)
ubound = area_ratio.quantile(0.975)
ar = area_ratio[(area_ratio > lbound) & (area_ratio < ubound)]
ar_min = int(np.floor(min(ar)))  ## Minimum value in 95% group
ar_max = int(np.ceil(max(ar)))  ## Maximum value in 95% group
interval = list(map(lambda x:x/2, list(range(ar_min*2, ar_max*2+1, 1))))
ar_bins = [0] + interval + [float('inf')]  ## Prepare bins to display values
cut = pd.cut(area_ratio, bins=ar_bins)
area_ratio = pd.concat([area_ratio, cut], axis=1)
area_ratio.columns = ['AREA_RATIO', 'AREA_RATIO_GROUP']

## Plot 
l = sns.countplot(x='AREA_RATIO_GROUP', data=area_ratio, ax=axis1)
axis1.set_xticklabels(l.get_xticklabels(), rotation=60, ha='right')
axis1.set_title('Ratio: ShapeSTArea / ACRES', fontsize=14)
axis1.set_xlabel(axis1.get_xlabel(), size=12)
axis1.set_ylabel(axis1.get_ylabel(), size=12)
axis1.yaxis.set_major_formatter(mpl.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
l.tick_params(labelsize=8)

## Remove rows with excentric values outside 95% near the mean
wetlands = wetlands[(wetlands['AREA_RATIO'] > lbound) & (wetlands['AREA_RATIO'] < ubound)]

<p>Interesting! More than half entries fall between 4,043.5 and 4,044.5. With a bit of discrepancy, it generally meets my expected value of 4,046.8 above (with less than 0.7% discrepancy). As mentioned above, I'll go with only data of 95% around the mean value (i.e. removing excentric values) for the next steps.
</p>

<h1>4. Circularity to approximate shape</h1>
<p>
We have now two columns to figure out relations: <b>ShapeSTArea</b> and <b>ShapeSTLength</b>. Though data given for this question of visualizing shape of wetlands is not sufficient to do so, we can at least approximate the shape by using circularity that can be calculated from these two columns. Circularity is defined as a degree to which the particle is similar to a circle, considering the smoothness of the perimeter (ISO 9276-6). The more circular wetland's shape is, the closer its circularity is to 1. If a wetland tends to be linear, then the circularity goes toward 0. Circularity can be calculated as follows:
</p>
<blockquote>
<math>
  &radic;<span style="text-decoration: overline">
    <mi><mi>4</mi><mo>*</mo><mi>&#8508</mi><mo>*</mo><mi>Area</mi><mo>/</mo><mi>Perimeter</mi><sup>2</sup></mi>
  </span>
</math>
</blockquote>
<p>
I'll first display means and standard deviations of circularity of each <b>CATEGORY (L, P, R)</b> as well as some quantile point values to see how circularity values are distributed in the 3 categories. And then, I'll plot a bar chart with 10 brakets (0, 0.1] (0.1, 0.2](0.2, 0.3] ... (0.9 1.0]. It will show how each type of wetland looks like approximately. Intuitively speaking, Pond-like p-type wetlands would tend to have circularity values greater than that of river-like r-type wetlands (rivers are usually more linear). Lakes might be in between. Let's see...
</p>

In [None]:
## Calculate circularity defined as above, and then split the data by each type (L, P, R) 
wlen = wetlands['ShapeSTLength']
warea = wetlands['ShapeSTArea']
wetlands['CIRCULARITY'] = np.sqrt(4*np.pi*warea/np.square(wlen))
circ_l = wetlands[wetlands['CATEGORY']=='L']['CIRCULARITY']
circ_p = wetlands[wetlands['CATEGORY']=='P']['CIRCULARITY']
circ_r = wetlands[wetlands['CATEGORY']=='R']['CIRCULARITY']

## Basic statistics information of each type (L, P, R) 
print('CIRCULARITY MEAN OF L-TYPE: {}, CIRCULARITY STD OF L-TYPE: {}'.format(np.mean(circ_l), np.std(circ_l)))
print('CIRCULARITY MEAN OF P-TYPE: {}, CIRCULARITY STD OF P-TYPE: {}'.format(np.mean(circ_p), np.std(circ_p)))
print('CIRCULARITY MEAN OF R-TYPE: {}, CIRCULARITY STD OF R-TYPE: {}'.format(np.mean(circ_r), np.std(circ_r)))
lq = circ_l.quantile([0, 0.025, 0.05, 0.5, 0.95, 0.975, 1])
pq = circ_p.quantile([0, 0.025, 0.05, 0.5, 0.95, 0.975, 1])
rq = circ_r.quantile([0, 0.025, 0.05, 0.5, 0.95, 0.975, 1])
circ_qt = pd.concat([lq, pq, rq], axis=1)
circ_qt.columns = ['CIRCULARITY_L', 'CIRCULARITY_P', 'CIRCULARITY_R']
circ_qt.index.rename('QUANTILE', inplace=True)
display(circ_qt)

## Plot circularity of each type using a bin [0.0, 0.1, 0.2, ... 0.9, 1.0]
fig, (axis1, axis2, axis3) = plt.subplots(1,3, figsize=(12,5))
bins = list(map(lambda x:x/10, list(range(0,11,1))))
cut_l = pd.cut(circ_l, bins=bins).to_frame()
cut_l.columns = ['CIRC_GROUP']
sns.countplot(x='CIRC_GROUP', data=cut_l, ax=axis1)
cut_p = pd.cut(circ_p, bins=bins).to_frame()
cut_p.columns = ['CIRC_GROUP']
sns.countplot(x='CIRC_GROUP', data=cut_p, ax=axis2)
cut_r = pd.cut(circ_r, bins=bins).to_frame()
cut_r.columns = ['CIRC_GROUP']
sns.countplot(x='CIRC_GROUP', data=cut_r, ax=axis3)

## Title, labels, ticks, annotations
dtype = ['L-type', 'P-type', 'R-type']
axis = [axis1, axis2, axis3]
total = [len(cut_l), len(cut_p), len(cut_r)]
i=0
while i < len(axis):
    axis[i].set_title(dtype[i], fontsize=14)
    axis[i].set_xlabel(axis[i].get_xlabel(), size=12)
    axis[i].set_ylabel(axis[i].get_ylabel(), size=12)
    axis[i].set_xticklabels(axis[i].get_xticklabels(), rotation=60, ha='right', size=8)
    axis[i].yaxis.set_major_formatter(mpl.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
    ## Annotate percentages of patches in each graphs
    for p in axis[i].patches:
        annt = 100*p.get_height()/total[i]
        x = p.get_x()+p.get_width()/2.0
        y = p.get_height()+0.002*total[i]
        axis[i].annotate('{:.1f}%'.format(annt), (x, y), ha='center', size=6)#, va='bottom')p.get_height()
    i=i+1

axis1.set_ylim([0,1200])
axis3.set_ylim([0,350])
plt.tight_layout()
     

<p>
Okay now it is really interesting! As mean and quantile values of each type of wetland suggest, circularity tends to be bigger in P-type and smaller in R-type (R &#60 L &#60 P). Look at the median values of P-type and R-type (0.805, 0.284). More than half of entries in category R wetlands have circularity values less than 0.3, which means they are of linear shape. For P-type, more than 50% of wetlands have circularity values greater than 0.8, which means they are more or less of circular shape. As guessed, lake-like wetlands are in between, having circularity values around 0.6-0.7 which can be interpreted as ellipse shape. You can refer to the document below for a rough interpretation of circularity.
</p>
<a href="http://www.ivtnetwork.com/sites/default/files/ImageAnalysis_01.pdf">
Particle Shape Factors (see page 93, 94)
</a>
<p>
Now, let's look into it more in detail. Remember back when I categorize <b>WETLAND_TYPE</b>, I grouped together 3 types such as "Freshwater Emergent Wetland", "Freshwater Forested/Shrub Wetland", and "Freshwater Pond". And category P contains also wetland_type "Other" in my analysis above. I'll see whether these sub types have different property than the general P-type property. 
</p>

In [None]:
## Divide P-type dataset into sub-groups as per WETLAND_TYPE
FE = wetlands[wetlands['WETLAND_TYPE']=='Freshwater Emergent Wetland']['CIRCULARITY']
FF = wetlands[wetlands['WETLAND_TYPE']=='Freshwater Forested/Shrub Wetland']['CIRCULARITY']
FP = wetlands[wetlands['WETLAND_TYPE']=='Freshwater Pond']['CIRCULARITY']
OT = wetlands[wetlands['WETLAND_TYPE']=='Other']['CIRCULARITY']

## Plot circularity of each type using a bin [0.0, 0.1, 0.2, ... 0.9, 1.0]
fig.clf()
fig, (axis1, axis2, axis3, axis4) = plt.subplots(1,4, figsize=(12,5))
bins = list(map(lambda x:x/10, list(range(0,11,1))))
cut_FE = pd.cut(FE, bins=bins).to_frame()
cut_FE.columns = ['CIRC_GROUP']
sns.countplot(x='CIRC_GROUP', data=cut_FE, ax=axis1)
cut_FF = pd.cut(FF, bins=bins).to_frame()
cut_FF.columns = ['CIRC_GROUP']
sns.countplot(x='CIRC_GROUP', data=cut_FF, ax=axis2)
cut_FP = pd.cut(FP, bins=bins).to_frame()
cut_FP.columns = ['CIRC_GROUP']
sns.countplot(x='CIRC_GROUP', data=cut_FP, ax=axis3)
cut_OT = pd.cut(OT, bins=bins).to_frame()
cut_OT.columns = ['CIRC_GROUP']
sns.countplot(x='CIRC_GROUP', data=cut_OT, ax=axis4)

## Title, labels, ticks, annotations
dtype = ['FE-type', 'FF-type', 'FP-type', 'OT-type']
axis = [axis1, axis2, axis3, axis4]
total = [len(FE), len(FF), len(FP), len(OT)]
i=0
while i < len(axis):
    axis[i].set_title(dtype[i], fontsize=12)
    axis[i].set_xlabel(axis[i].get_xlabel(), size=8)
    axis[i].set_ylabel(axis[i].get_ylabel(), size=8)
    axis[i].set_xticklabels(axis[i].get_xticklabels(), rotation=60, ha='right', size=6)
    axis[i].yaxis.set_major_formatter(mpl.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
    i=i+1

plt.tight_layout()

<p>
Although some minor differencies, all four sub-groups are of similar property - and they make up the general P-type wetland property that we saw at the previous graph above. Therefore these can be kept one group as P-type wetlands all together. 
</p>
<p>
Now, look at R-type wetlands. Most of them are of linear shape as we discovered above. There are however some entries that have quite high value of circularity. Intuitively it is somewhat odd that rivers are of circular shape... Let's look into it.
</p>

In [None]:
## Entries with category R
rtype = wetlands[wetlands['CATEGORY']=='R']
## Entries with category R with circularity value greater than 0.8 (river but rather circular shape)
rtype08 = wetlands[(wetlands['CATEGORY']=='R') & (wetlands['CIRCULARITY']>0.8)]
print('R-type, for all circulartiy: mean ShapeSTArea = {:.2f}, mean ShapeSTLength = {:.2f}'
      .format(rtype['ShapeSTArea'].mean(), rtype['ShapeSTLength'].mean()))
print('R-type, circulartiy>0.8: mean ShapeSTArea = {:.2f}, mean ShapeSTLength = {:.2f}'
      .format(rtype08['ShapeSTArea'].mean(), rtype08['ShapeSTLength'].mean()))

<p>
Alright. Of course ther are no river of circular shape rather than linear shape. These wetlands (category R & circularity > 0.8) are rivers that have extremely short perimeter (mean value: 237 m), compared to the whole group of category R (mean value: 10.465 km). They should be less than 100m long then. We can infer that these extremely short rivers may be detected by imagery circular shape or medium ellipse.
</p> 

<h1>5. End note</h1>
<p>
The given dataset is by far less sufficient than necessary to anlayze and visualize shapes of wetlands. Circularity is one property to approximate their shape but not the best option to fulfil the objctive. At least, I can show that wetlands in certain types have different circularity values therefore they can be roughly interpreted as either circular shape, ellipse or linear. As it matches to common sense, my analysis indicates that pond type wetlands are of circular shape, lakes are generally ellipsoidal, and rivers are usually linear. 
</p>
<p>
More detailed analyses can be done but I'll stop here. It was a simple exercise but fun to play with. It would have been better if more datasets and more explicit explanation about the data had been provided. I'll participate in other exercises/competitions that are related to the environment. Please feel free to leave any comments, advice, criticsm. Lastly, I wish good luck for other participants.
</p>