# Generate API Coverage Documentation

This notebook generates API coverage documentation from instrumentation
files and a pandas documentation reference, `DocumentedPandasAPI.csv`
when running snowpandas (either individual notebooks or through pytest).

## Generating the Results
- Run pytest with `--generate_pandas_api_coverage` to generate a set of `record-[TIMESTAMP].csv` files
- Change the `INSTRUMENTATION_FILES` variable below to point to these files (if needed)
- Ensure that `PANDA_DOCUMENTATION_MAP` points to the csv file containing all known pandas API calls
- Run the notebook. The record files will be collected and
concatinated into a dataframe.

## Interpreting the Results

- "✅" means that the function was executed without error every time
- "🟡" means that the function was executed and at least one invocation resulted in a NotImplementedError
- "❌" means that the function was never executed, it is not instrumented, or it produced only errors

Not all APIs are instrumented, and improvements to `PandasAPICoverageGenerator`
will be needed to ensure this API documentation is up to date.

In [28]:
INSTRUMENTATION_FILES="../../record-*.csv"
PANDA_DOCUMENTATION_MAP="DocumentedPandasAPI.csv"

In [29]:
import glob
import pandas as pd
import numpy as np
pd.options.mode.chained_assignment = None

In [30]:
# Collect a set of record files from instrumentation
# into a dataframe
filenames=glob.glob(INSTRUMENTATION_FILES)
dfs = []
for filename in filenames:
    dfs.append(pd.read_csv(filename, low_memory=True))
big_frame = pd.concat(dfs, ignore_index=True)
df = big_frame[["class", "method", "params", "exception", "start", "stop"]]

  dfs.append(pd.read_csv(filename, low_memory=True))
  dfs.append(pd.read_csv(filename, low_memory=True))
  dfs.append(pd.read_csv(filename, low_memory=True))


In [31]:
# Collect the documentation map into a dataframe
doc_df = pd.read_csv(PANDA_DOCUMENTATION_MAP)

In [32]:
df['class'].value_counts()

class
Index               4083681
Flags               1296588
MultiIndex          1293436
RangeIndex           544741
Timestamp            291569
Categorical          289171
DataFrame            205840
Series                28219
DatetimeIndex         25276
DataFrameGroupBy       9059
StringMethods          3406
modin.pandas           2151
Timedelta              1801
SeriesGroupBy          1758
CategoricalIndex        650
IntervalIndex           575
Period                  132
class                    17
PeriodIndex               4
TimedeltaIndex            3
Name: count, dtype: int64

In [33]:
# Map the snowpandas class names to the pandas class names
df = df[['class', 'method', 'exception']]
df['method'] = np.where(df['class'] == 'DatetimeProperties', "dt." + df['method'], df['method'])
df['class'] = np.where(df['class'] == 'DatetimeProperties', "Series", df['class'])
df['class'] = np.where(df['class'] == 'snowflake.snowpark.modin.pandas', "pandas", df['class'])
df['class'] = np.where(df['class'] == 'modin.pandas', "pandas", df['class'])
df['class'] = np.where(df['class'] == 'SnowparkPandasDataFrame', "DataFrame", df['class'])
df['class'] = np.where(df['class'] == 'SnowparkPandasSeries', "Series", df['class'])
df['method'] = np.where(df['class'] == 'StringMethods', "str." + df['method'], df['method'])
df['class'] = np.where(df['class'] == 'StringMethods', "Series", df['class'])
df['class'] = np.where(df['class'] == 'class', "other", df['class'])
df['class'].value_counts()

class
Index               4083681
Flags               1296588
MultiIndex          1293436
RangeIndex           544741
Timestamp            291569
Categorical          289171
DataFrame            205840
Series                31625
DatetimeIndex         25276
DataFrameGroupBy       9059
pandas                 2151
Timedelta              1801
SeriesGroupBy          1758
CategoricalIndex        650
IntervalIndex           575
Period                  132
other                    17
PeriodIndex               4
TimedeltaIndex            3
Name: count, dtype: int64

In [34]:
# Any call w/o an exception is "Supported"
df['coverage'] = df['exception'].fillna("Supported")
# Map Supported -> ✅
#     NotImplementedError -> ❌
#     All other errors -> ⚠
df['coverage'] = np.where(df['coverage'] == 'Supported', "✅",
                          np.where(df['coverage'] == 'NotImplementedError', '❌','⚠'))
supported_df = df[['class', 'method', 'coverage']]

In [35]:
doc_df.columns= doc_df.columns.str.lower()
result_df=supported_df.merge(doc_df, on=['class', 'method'], how="right")
result_df['group']=result_df['group'].fillna("pandas")
result_df = result_df.pivot_table(index=['class', 'group', 'method'], values=['milestone', 'coverage'], 
                                  aggfunc={'coverage':pd.Series.unique, 'milestone':pd.Series.max}, fill_value="")
result_df['coverage'] = result_df['coverage'].apply(lambda x: 
                               '🟡' if "✅" in x and "❌" in x
                               else '✅' if "✅" in x and "❌" not in x
                               else '❌')

result_df = result_df.round(4)
result_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,coverage,milestone
class,group,method,Unnamed: 3_level_1,Unnamed: 4_level_1
CategoricalIndex,Categorical components,add_categories,❌,
CategoricalIndex,Categorical components,as_ordered,❌,
CategoricalIndex,Categorical components,as_unordered,❌,
CategoricalIndex,Categorical components,categories,✅,
CategoricalIndex,Categorical components,codes,✅,
...,...,...,...,...
pandas,Top-level evaluation,eval,❌,
pandas,Top-level missing data,isna,✅,
pandas,Top-level missing data,isnull,❌,
pandas,Top-level missing data,notna,❌,


In [36]:
# Write out API documentation
with open('Supported_API.md', 'w') as fo:
    fo.write(result_df.reset_index().to_markdown(index=True, tablefmt="github"))
# Write out API documentation
with open('Supported_API.csv', 'w') as fo:
    fo.write(result_df.reset_index().to_csv(index=True))

In [55]:
final_df = result_df
final_df = final_df.reset_index()
final_df = final_df[['coverage']]
final_df_count = final_df.value_counts()
#final_df = final_df_count.reset_index()
#final_df["count"] = final_df.loc[:,'proportion']
final_df_count

coverage
❌           624
✅           347
🟡            43
Name: count, dtype: int64

In [38]:
def report_perc(by):
    final_df = result_df
    final_df = final_df.reset_index()
    final_df = final_df[[by, 'coverage']].sort_values(by=by)
    final_df_count = final_df.groupby(by).value_counts(normalize=True)
    final_df = final_df_count.reset_index()
    final_df["count"] = final_df.loc[:,'proportion']
    final_df.round(2)
    final_df = final_df.pivot_table(index=[by], columns=['coverage'], values=["count"],
                                  aggfunc={'count':pd.Series.sum})

    return (final_df * 100).round(0).astype("str") + "%"
  
def report_num(by):
    final_df = result_df
    final_df = final_df.reset_index()
    final_df = final_df[[by, 'coverage']].sort_values(by=by)
    final_df_count = final_df.groupby(by).value_counts()
    final_df = final_df_count.reset_index()
    final_df["count"] = final_df.loc[:,'count']
    final_df.round(2)
    final_df = final_df.pivot_table(index=[by], columns=['coverage'], values=["count"],
                                  aggfunc={'count':pd.Series.sum})

    return final_df

In [49]:
print("Snowpandas Supported API Count By Phase: ")
report_perc("class")

Snowpandas Supported API Count By Phase: 


Unnamed: 0_level_0,count,count,count
coverage,✅,❌,🟡
class,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
CategoricalIndex,15.0%,85.0%,nan%
DataFrame,43.0%,48.0%,10.0%
DataFrameGroupBy,32.0%,56.0%,12.0%
DatetimeIndex,32.0%,68.0%,nan%
Expanding,nan%,100.0%,nan%
ExponentialMovingWindow,nan%,100.0%,nan%
Index,65.0%,35.0%,nan%
IntervalIndex,17.0%,83.0%,nan%
MultiIndex,64.0%,36.0%,nan%
PeriodIndex,nan%,100.0%,nan%


In [41]:
report_num("class")

Unnamed: 0_level_0,count,count,count
coverage,✅,❌,🟡
class,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
CategoricalIndex,2.0,11.0,
DataFrame,94.0,105.0,21.0
DataFrameGroupBy,19.0,33.0,7.0
DatetimeIndex,15.0,32.0,
Expanding,,17.0,
ExponentialMovingWindow,,6.0,
Index,55.0,29.0,
IntervalIndex,3.0,15.0,
MultiIndex,18.0,10.0,
PeriodIndex,,25.0,
